Highly Skilled Web Data Extractor

2 weeks ago


Allahabad, Uttar Pradesh, India beBeeSoftwareEngineer Full time ₹ 1,50,00,000 - ₹ 2,50,00,000
Web Scraping Engineer

We're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll be responsible for the crawling/extraction layer end-to-end: HTTP-first crawling with a Playwright fallback, per-domain learned selectors, and reliable PDF handling (datasheets/specs).

This role encompasses crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema). Key responsibilities include designing an HTTP-first crawler, implementing sitemap diffing and conditional GETs, building a lightweight classifier to auto-route HTTP vs Playwright, enforcing per-domain throttles/backoff, adding URL normalization/canonicalization and de-duplication, handling PDF discovery & download, applying Playwright browser automation resource budgets, integrating third-party APIs, owning automation & orchestration for scheduled runs, creating per-domain selectors, shipping observability, maintaining allow/deny paths, adhering to robots.txt and Terms of Service.

Must-haves include 4+ years of Python experience, strong skills in Scrapy or aiohttp/asyncio and Playwright (or Puppeteer) in production, practical proxy management, polite anti-bot tactics, and per-domain rate limiting, hands-on experience with ETag/Last-Modified, retries, backoff, and HTTP caching, confidence with CSS/XPath, schema.org/JSON-LD, and HTML parsing, APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services, automation/orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring, PDF handling (requests/HEAD, hashing, size limits) and file integrity checks, queues (Redis/Kafka), Docker, Linux basics, clear, pragmatic communication and strong ownership.

  • Design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS-heavy pages.
  • Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
  • Build a lightweight 'needs JS?' classifier (HTML length, JSON-LD presence, data-product markers) to auto-route HTTP vs Playwright.
  • Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto-lower on 429/503).
  • Add URL normalization/canonicalization and de-dup (respect ; hash PDFs).
  • Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA-256 keys).
  • Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
  • Integrate third-party APIs (REST/GraphQL) as first-class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs.
  • Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
  • Create per-domain selectors (YAML) with verification on hold-outs; re-learn only when health drops.
  • Ship observability: per-site field coverage, error rates, retries, avg page time, and PDF success.
  • Maintain allow/deny paths; adhere to robots.txt and Terms of Service.
How We Work
  • Deliver in small, measurable increments.
  • Track coverage and freshness as north-star metrics.
  • Prefer simple designs that are easy to operate at scale.
Benefits

We offer competitive compensation. Please include your expected salary range in INR LPA and any variable/benefits expectations.



  • Allahabad, Uttar Pradesh, India beBeeDeveloper Full time ₹ 10,00,000 - ₹ 15,00,000

    Full Stack DeveloperJob Title: Full Stack DeveloperWe are seeking a highly skilled Full Stack Developer to join our dynamic development team. This full-time role requires a seasoned professional with 6+ years of comprehensive web development experience who can handle both frontend and backend development responsibilities while contributing to our DevOps and...


  • Allahabad, Uttar Pradesh, India beBeeDataInsight Full time ₹ 6,00,000 - ₹ 10,00,000

    Drive Business Growth with Data InsightsJob Description:As a skilled Data Analyst, you will play a pivotal role in generating high-standard service delivery, achieving successful outcomes for clients, capturing data and sharing knowledge across projects, enhancing our culture of innovation and reinforcing our reputation as a preferred service provider.Main...


  • Allahabad, Uttar Pradesh, India beBeeDataEngineer Full time ₹ 20,00,000 - ₹ 25,00,000

    Big Data EngineerThe role involves designing, developing, and optimizing large-scale data pipelines and distributed data processing systems.


  • Allahabad, Uttar Pradesh, India beBeeIntegration Full time US$ 99,690 - US$ 1,26,355

    Workday Integration SpecialistAre you a seasoned Workday expert with experience in integrations and reporting? We are seeking a highly skilled professional to join our team as a Workday Integration Specialist. This is an exciting opportunity for someone who wants to leverage their technical expertise to drive business success.Job Description:The Workday...


  • Allahabad, Uttar Pradesh, India beBeeBackend Full time ₹ 18,00,000 - ₹ 24,00,000

    Job Opportunity:We are seeking a highly skilled Back-End developer to join our team. The ideal candidate will have hands-on experience in .NET Core, SQL Server and Selenium-based test automation.Develop scalable web applications using .NET Core.Design and optimize relational databases using SQL Server.Implement and maintain automated test cases using...


  • Allahabad, Uttar Pradesh, India beBeeDataEngineer Full time US$ 1,70,000 - US$ 2,02,000

    Job OpportunityWe are seeking an experienced Data Engineer to join our team.About the Role:This position is a key contributor in building and scaling modern financial data platforms, developing and optimizing data warehouse solutions, and ensuring performance and scalability for finance and accounting workloads.Key Responsibilities:Snowflake Data...


  • Allahabad, Uttar Pradesh, India beBeeDataSpecialist Full time ₹ 18,00,000 - ₹ 26,00,000

    Data Engineer and Analytics Specialist">The role of a Data Engineer and Analytics Specialist involves designing, building, and maintaining the infrastructure for storing, processing, and analyzing large datasets. This includes creating data pipelines, implementing data modeling best practices, and ensuring data governance.Key Responsibilities:Data...


  • Allahabad, Uttar Pradesh, India beBeeData Full time ₹ 80,00,000 - ₹ 1,50,00,000

    Backend Data EngineerWe're seeking a seasoned Backend Data Engineer to join our team. The ideal candidate will possess strong experience in building high-performance data pipelines and developing scalable backend systems.This role involves transforming raw on-chain data into actionable insights by decoding smart contract events and implementing pricing logic...


  • Allahabad, Uttar Pradesh, India beBeefrontend Full time ₹ 1,50,00,000 - ₹ 2,50,00,000

    Frontend Development ExpertWe are seeking a highly skilled professional to design, develop and maintain user interfaces for web applications.Main Responsibilities:Create responsive web interfaces that meet user needsCollaborate with cross-functional teams for seamless backend integrationOptimize web applications for performance and scalabilityRequired Skills...


  • Allahabad, Uttar Pradesh, India beBeeSoftware Full time US$ 21,280 - US$ 33,520

    Job Opportunity:We are seeking a seasoned software engineer with expertise in the MERN stack to join our organization. The ideal candidate will possess a strong background in crafting high-quality, optimized code and be able to develop architectural patterns for large-scale web applications.About the Role:The chosen candidate will work closely with our...