
Senior Web Scraping Engineer
2 days ago
Employment type: Full-time (open to contract-to-hire)
Work location: Remote in India
Time overlap: Prefer 2–3 hours/day with Pacific Time (PST)
About the roleWe're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer end‑to‑end: crawling with a Playwright fallback, per‑domain learned selectors, and reliable PDF handling (datasheets/specs). You'll also drive the automation around scheduling, retries, and monitoring so runs are hands‑off, and you'll integrate vendor/public APIs (REST/GraphQL) wherever available to complement crawling.
This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema).What you'll do- Design an crawler (Scrapy or aio with Playwright fallback only for JS‑heavy pages.
- Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
- Build a lightweight "needs JS?" classifier (HTML length, JSON‑LD presence, data‑product markers) to auto‑route vs Playwright.
- Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto‑lower on 429/503).
- Add URL normalization/canonicalization and de‑dup (respect ; hash PDFs).
- Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA‑256 keys).
- Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
- Integrate third‑party APIs (REST/GraphQL) as first‑class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs .
- Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
- Create per‑domain selectors (YAML) with verification on hold‑outs; re‑learn only when health drops.
- Ship observability : per‑site field coverage, error rates, retries, avg page time, and PDF success.
- Maintain allow/deny paths ; adhere to and Terms of Service.
- Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.
- 4+ years Python , including 2+ years building production web crawlers at scale.
- Strong with Scrapy or aio and Playwright (or Puppeteer) in production.
- Practical proxy management , polite anti‑bot tactics, and per‑domain rate limiting .
- Hands‑on with ETag/Last-Modified , retries, backoff, and caching.
- Confident with CSS/XPath , ‑LD , and HTML parsing.
- APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services ( FastAPI or similar).
- Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
- PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
- Queues ( Redis/Kafka ), Docker , Linux basics; comfort with logs/metrics.
- Clear, pragmatic communication and strong ownership.
- Go or experience for high‑performance crawlers.
- Cloud: AWS/GCP , S3 , ECS/Kubernetes; IaC basics.
- Workflow engines: Airflow/Temporal/Argo/Celery .
- Document extraction: Textract/Tika/Camelot/Tabula .
- Search/analytics: Elasticsearch/OpenSearch ; warehousing ( Snowflake/Postgres ).
- LLM‑assisted selector generation with deterministic verification (optional).
- Ship in small, measurable increments.
- Track coverage and freshness as north‑star metrics.
- Prefer simple designs that are easy to operate at scale.
Competitive; please include your expected CTC (INR LPA) and any variable/benefits expectations.
ApplicationPlease apply with your resume and links to relevant repos or code samples. Include concise notes on:
- a crawler you ran at 100+ sites/day (or similar scale),
- how you handled rate limits/retries , and
- your approach to PDF discovery/dedup .
-
Data Engineer
1 day ago
India Alternative Path Full timeAlternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...
-
Data Engineer
5 hours ago
India Alternative Path Full timeAlternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...
-
Data Engineer
22 hours ago
India Alternative Path Full timeAlternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...
-
Senior Web Scraping and Data Automation Expert
2 weeks ago
India beBeeDataAutomation Full time ₹ 9,00,000 - ₹ 12,00,000Senior Web Scraping and Data Automation ExpertWe are seeking a high-level web scraping/data extraction specialist for a long-term freelance project that requires advanced technical skills, reliability, and creativity.This mission involves building a strategic data system that extracts and structures data from various online sources with dynamic content,...
-
sr python developer web scraping
7 hours ago
Ahmedabad, India Actowiz Solutions Full timeJob Description Job Title: Senior Python Developer Web Scraping & Automation Company: Actowiz Solutions Location: Ahmedabad Job Type: Full-time Working Days: 5 Days a Week About Us Actowiz Solutions is a leading provider of data extraction, web scraping, andautomation solutions. We empower businesses with actionable insights by deliveringclean,...
-
Web Scraping Data Engineer
1 week ago
India Remote YipitData (Alternative) Full time ₹ 5,00,000 - ₹ 10,00,000 per yearAbout Us: YipitData is the leading market research and analytics firm for the disruptive economy and most recently raised $475M from The Carlyle Group at a valuation of over $1B. Every day, our proprietary technology analyzes billions of alternative data points to uncover actionable insights across sectors like software, AI, cloud, e-commerce, ridesharing,...
-
Urgent Search! Data Engineer
8 hours ago
India Alternative Path Full timeAlternative Path is seeking skilled software developers to collaborate on client projects with an asset management firm. In this role, you will collaborate with individuals across various company departments to shape and innovate new products and features for our platform, enhancing existing ones. You will have a large degree of independence and trust, but...
-
Web Scraping Specialist
1 week ago
India beBeeExpertise Full time ₹ 15,00,000 - ₹ 20,00,000Job Title: Web DeveloperJob Description:We are seeking a highly skilled web scraping expert to lead a long-term project with significant ambitions.About the Mission:Our objective is to create a strategic data system requiring extraction and structuring from various online sources involving dynamic content, custom headers, request simulation, automation...
-
Urgent: Senior Web Scraping Engineer
4 days ago
India TripleChoice Inc Full timeSenior Web Scraping Engineer (Python) — India (Remote)Employment type: Full-time (open to contract-to-hire)Work location: Remote in IndiaTime overlap: Prefer 2–3 hours/day with Pacific Time (PST)About the roleWe're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer...
-
Junior Web Crawling Engineer
2 weeks ago
India Forage AI Full timeJob DescriptionWe are seeking a Junior Web Crawling Engineer who will be responsible for building and maintaining web crawlers, extracting valuable insights from the web, and ensuring data quality. The ideal candidate will have strong Python programming skills and experience in web scraping frameworks, browser automation tools, and handling anti-scraping...