Senior Web Scraping Engineer

2 days ago


India TripleChoice Inc Full time
Senior Web Scraping Engineer (Python) — India (Remote)

Employment type: Full-time (open to contract-to-hire)

Work location: Remote in India

Time overlap: Prefer 2–3 hours/day with Pacific Time (PST)

About the role

We're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer end‑to‑end: crawling with a Playwright fallback, per‑domain learned selectors, and reliable PDF handling (datasheets/specs). You'll also drive the automation around scheduling, retries, and monitoring so runs are hands‑off, and you'll integrate vendor/public APIs (REST/GraphQL) wherever available to complement crawling.

This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema).What you'll do
  • Design an crawler (Scrapy or aio with Playwright fallback only for JS‑heavy pages.
  • Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
  • Build a lightweight "needs JS?" classifier (HTML length, JSON‑LD presence, data‑product markers) to auto‑route vs Playwright.
  • Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto‑lower on 429/503).
  • Add URL normalization/canonicalization and de‑dup (respect ; hash PDFs).
  • Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA‑256 keys).
  • Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
  • Integrate third‑party APIs (REST/GraphQL) as first‑class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs .
  • Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
  • Create per‑domain selectors (YAML) with verification on hold‑outs; re‑learn only when health drops.
  • Ship observability : per‑site field coverage, error rates, retries, avg page time, and PDF success.
  • Maintain allow/deny paths ; adhere to and Terms of Service.
  • Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.
Must‑have qualifications
  • 4+ years Python , including 2+ years building production web crawlers at scale.
  • Strong with Scrapy or aio and Playwright (or Puppeteer) in production.
  • Practical proxy management , polite anti‑bot tactics, and per‑domain rate limiting .
  • Hands‑on with ETag/Last-Modified , retries, backoff, and caching.
  • Confident with CSS/XPath , ‑LD , and HTML parsing.
  • APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services ( FastAPI or similar).
  • Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
  • PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
  • Queues ( Redis/Kafka ), Docker , Linux basics; comfort with logs/metrics.
  • Clear, pragmatic communication and strong ownership.
Nice to have
  • Go or experience for high‑performance crawlers.
  • Cloud: AWS/GCP , S3 , ECS/Kubernetes; IaC basics.
  • Workflow engines: Airflow/Temporal/Argo/Celery .
  • Document extraction: Textract/Tika/Camelot/Tabula .
  • Search/analytics: Elasticsearch/OpenSearch ; warehousing ( Snowflake/Postgres ).
  • LLM‑assisted selector generation with deterministic verification (optional).
How we work
  • Ship in small, measurable increments.
  • Track coverage and freshness as north‑star metrics.
  • Prefer simple designs that are easy to operate at scale.
Compensation

Competitive; please include your expected CTC (INR LPA) and any variable/benefits expectations.

Application

Please apply with your resume and links to relevant repos or code samples. Include concise notes on:

  1. a crawler you ran at 100+ sites/day (or similar scale),
  2. how you handled rate limits/retries , and
  3. your approach to PDF discovery/dedup .

  • Data Engineer

    1 day ago


    India Alternative Path Full time

    Alternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...

  • Data Engineer

    5 hours ago


    India Alternative Path Full time

    Alternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...

  • Data Engineer

    22 hours ago


    India Alternative Path Full time

    Alternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...


  • India beBeeDataAutomation Full time ₹ 9,00,000 - ₹ 12,00,000

    Senior Web Scraping and Data Automation ExpertWe are seeking a high-level web scraping/data extraction specialist for a long-term freelance project that requires advanced technical skills, reliability, and creativity.This mission involves building a strategic data system that extracts and structures data from various online sources with dynamic content,...


  • Ahmedabad, India Actowiz Solutions Full time

    Job Description Job Title: Senior Python Developer Web Scraping & Automation Company: Actowiz Solutions Location: Ahmedabad Job Type: Full-time Working Days: 5 Days a Week About Us Actowiz Solutions is a leading provider of data extraction, web scraping, andautomation solutions. We empower businesses with actionable insights by deliveringclean,...


  • India Remote YipitData (Alternative) Full time ₹ 5,00,000 - ₹ 10,00,000 per year

    About Us: YipitData is the leading market research and analytics firm for the disruptive economy and most recently raised $475M from The Carlyle Group at a valuation of over $1B. Every day, our proprietary technology analyzes billions of alternative data points to uncover actionable insights across sectors like software, AI, cloud, e-commerce, ridesharing,...


  • India Alternative Path Full time

    Alternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...


  • India beBeeExpertise Full time ₹ 15,00,000 - ₹ 20,00,000

    Job Title: Web DeveloperJob Description:We are seeking a highly skilled web scraping expert to lead a long-term project with significant ambitions.About the Mission:Our objective is to create a strategic data system requiring extraction and structuring from various online sources involving dynamic content, custom headers, request simulation, automation...


  • India TripleChoice Inc Full time

    Senior Web Scraping Engineer (Python) — India (Remote)Employment type: Full-time (open to contract-to-hire)Work location: Remote in IndiaTime overlap: Prefer 2–3 hours/day with Pacific Time (PST)About the roleWe're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer...


  • India Forage AI Full time

    Job DescriptionWe are seeking a Junior Web Crawling Engineer who will be responsible for building and maintaining web crawlers, extracting valuable insights from the web, and ensuring data quality. The ideal candidate will have strong Python programming skills and experience in web scraping frameworks, browser automation tools, and handling anti-scraping...