Senior Web Scraping Engineer

2 weeks ago


Bengaluru, Karnataka, India TripleChoice Inc Full time
Senior Web Scraping Engineer (Python) — India (Remote)

Employment type: Full-time (open to contract-to-hire)

Work location: Remote in India

Time overlap: Prefer 2–3 hours/day with Pacific Time (PST)

About the role

We're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer end‑to‑end: HTTP-first crawling with a Playwright fallback, per‑domain learned selectors, and reliable PDF handling (datasheets/specs). You'll also drive the automation around scheduling, retries, and monitoring so runs are hands‑off, and you'll integrate vendor/public APIs (REST/GraphQL) wherever available to complement crawling.

This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema).What you'll do
  • Design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS‑heavy pages.
  • Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
  • Build a lightweight "needs JS?" classifier (HTML length, JSON‑LD presence, data‑product markers) to auto‑route HTTP vs Playwright.
  • Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto‑lower on 429/503).
  • Add URL normalization/canonicalization and de‑dup (respect ; hash PDFs).
  • Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA‑256 keys).
  • Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
  • Integrate third‑party APIs (REST/GraphQL) as first‑class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs.
  • Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
  • Create per‑domain selectors (YAML) with verification on hold‑outs; re‑learn only when health drops.
  • Ship observability: per‑site field coverage, error rates, retries, avg page time, and PDF success.
  • Maintain allow/deny paths; adhere to robots.txt and Terms of Service.
  • Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.
Must‑have qualifications
  • 4+ years Python, including 2+ years building production web crawlers at scale.
  • Strong with Scrapy or aiohttp/asyncio and Playwright (or Puppeteer) in production.
  • Practical proxy management, polite anti‑bot tactics, and per‑domain rate limiting.
  • Hands‑on with ETag/Last-Modified, retries, backoff, and HTTP caching.
  • Confident with CSS/XPath, schema.org/JSON‑LD, and HTML parsing.
  • APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services (FastAPI or similar).
  • Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
  • PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
  • Queues (Redis/Kafka), Docker, Linux basics; comfort with logs/metrics.
  • Clear, pragmatic communication and strong ownership.
Nice to have
  • Go or Node.js experience for high‑performance crawlers.
  • Cloud: AWS/GCP, S3, ECS/Kubernetes; IaC basics.
  • Workflow engines: Airflow/Temporal/Argo/Celery.
  • Document extraction: Textract/Tika/Camelot/Tabula.
  • Search/analytics: Elasticsearch/OpenSearch; warehousing (Snowflake/Postgres).
  • LLM‑assisted selector generation with deterministic verification (optional).
How we work
  • Ship in small, measurable increments.
  • Track coverage and freshness as north‑star metrics.
  • Prefer simple designs that are easy to operate at scale.
Compensation

Competitive; please include your expected CTC (INR LPA) and any variable/benefits expectations.

Application

Please apply with your resume and links to relevant repos or code samples. Include concise notes on:

  • a crawler you ran at 100+ sites/day (or similar scale),
  • how you handled rate limits/retries, and
  • your approach to PDF discovery/dedup.


  • Bengaluru, Karnataka, India Hypersonix Full time

    Position Overview We are seeking a highly skilled Web Scraping Architect to join our team The successful candidate will be responsible for designing implementing and maintaining web scraping processes to gather data from various online sources efficiently and accurately As a Web Scraping Specialist you will play a crucial role in collecting data for...


  • Bengaluru, Karnataka, India Hypersonix Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Position Overview:We are seeking a highly skilled Web Scraping Architect to join our team. The successful candidate will be responsible for designing, implementing, and maintaining web scraping processes to gather data from various online sources efficiently and accurately. As a Web Scraping Specialist, you will play a crucial role in collecting data for...


  • Bengaluru, Karnataka, India Gmware Full time ₹ 1,50,000 - ₹ 28,00,000 per year

    We are hiring a Python Developer (0.52 yrs) for web scraping. Responsibilities: build & optimize scrapers, handle dynamic sites, proxies & CAPTCHAs, ensure data accuracy. Skills: Python, Scrapy, BeautifulSoup, Selenium, regex, debugging.Provident fundHealth insurance


  • Bengaluru, Karnataka, India TripleChoice Inc Full time

    Senior Web Scraping Engineer (Python) — India (Remote)Employment type: Full-time (open to contract-to-hire) Work location: Remote in India Time overlap: Prefer 2–3 hours/day with Pacific Time (PST) About the roleWe're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer...


  • Bengaluru, Karnataka, India beBeeWebScraping Full time ₹ 10,00,000 - ₹ 15,00,000

    Job Title:Web Scraping Solutions ArchitectAbout the Role:We are seeking a skilled Web Scraping Solutions Architect to design and implement high-throughput product data ingestion pipelines.Key Responsibilities:Design and implement end-to-end web scraping solutions using Python.Owning the crawling/extraction layer: HTTP-first crawling with a Playwright...


  • Bengaluru, Karnataka, India beBeeDataMining Full time ₹ 9,00,000 - ₹ 18,00,000

    Data Mining AnalystPosition: Data Mining AnalystThis role entails leveraging web scraping and data extraction expertise to drive business insights. We are seeking a skilled professional with expertise in automating data extraction processes from web platforms, utilizing tools like Python, Selenium, Pandas, SQL, and APIs.The ideal candidate will have the...


  • Bengaluru, Karnataka, India Z Brands Full time ₹ 15,00,000 - ₹ 28,00,000 per year

    About the RoleWe're looking for an experienced Senior Backend Engineer who understands what it takes to build scalable data-extraction systems and robust automation services.You'll lead the design and implementation of high-volume scraping pipelines, distributed crawlers, and resilient APIs that power our next-generation AI products.Key ResponsibilitiesWeb...


  • Bengaluru, Karnataka, India Z Brands Full time

    About the CompanyZ Brands is a profitable and rapidly growing consumer app studio. We leverage AI to analyze, design, and then build category-winning software products.Our team is Forbes 30u30, IIT, ex-MSFT, ex-PayTM and across our team we've built businesses that generate millions every year.About the RoleWe're looking for an experienced Senior Backend...


  • Bengaluru, Karnataka, India Z Brands Full time

    About the Company Z Brands is a profitable and rapidly growing consumer app studio. We leverage AI to analyze, design, and then build category-winning software products. Our team is Forbes 30u30, IIT, ex-MSFT, ex-PayTM and across our team we've built businesses that generate millions every year. About the Role We're looking for an experienced...


  • Bengaluru, Karnataka, India beBeeScraping Full time ₹ 90,00,000 - ₹ 1,25,00,000

    Senior Web Scraping Engineer PositionWe're developing a high-throughout product data ingestion pipeline across hundreds of domains. This position entails owning the crawling/extraction layer end-to-end.