
Senior Web Scraping Engineer
2 weeks ago
Employment type: Full-time (open to contract-to-hire)
Work location: Remote in India
Time overlap: Prefer 2–3 hours/day with Pacific Time (PST)
About the roleWe're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer end‑to‑end: HTTP-first crawling with a Playwright fallback, per‑domain learned selectors, and reliable PDF handling (datasheets/specs). You'll also drive the automation around scheduling, retries, and monitoring so runs are hands‑off, and you'll integrate vendor/public APIs (REST/GraphQL) wherever available to complement crawling.
This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema).What you'll do- Design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS‑heavy pages.
- Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
- Build a lightweight "needs JS?" classifier (HTML length, JSON‑LD presence, data‑product markers) to auto‑route HTTP vs Playwright.
- Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto‑lower on 429/503).
- Add URL normalization/canonicalization and de‑dup (respect ; hash PDFs).
- Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA‑256 keys).
- Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
- Integrate third‑party APIs (REST/GraphQL) as first‑class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs.
- Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
- Create per‑domain selectors (YAML) with verification on hold‑outs; re‑learn only when health drops.
- Ship observability: per‑site field coverage, error rates, retries, avg page time, and PDF success.
- Maintain allow/deny paths; adhere to robots.txt and Terms of Service.
- Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.
- 4+ years Python, including 2+ years building production web crawlers at scale.
- Strong with Scrapy or aiohttp/asyncio and Playwright (or Puppeteer) in production.
- Practical proxy management, polite anti‑bot tactics, and per‑domain rate limiting.
- Hands‑on with ETag/Last-Modified, retries, backoff, and HTTP caching.
- Confident with CSS/XPath, schema.org/JSON‑LD, and HTML parsing.
- APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services (FastAPI or similar).
- Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
- PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
- Queues (Redis/Kafka), Docker, Linux basics; comfort with logs/metrics.
- Clear, pragmatic communication and strong ownership.
- Go or Node.js experience for high‑performance crawlers.
- Cloud: AWS/GCP, S3, ECS/Kubernetes; IaC basics.
- Workflow engines: Airflow/Temporal/Argo/Celery.
- Document extraction: Textract/Tika/Camelot/Tabula.
- Search/analytics: Elasticsearch/OpenSearch; warehousing (Snowflake/Postgres).
- LLM‑assisted selector generation with deterministic verification (optional).
- Ship in small, measurable increments.
- Track coverage and freshness as north‑star metrics.
- Prefer simple designs that are easy to operate at scale.
Competitive; please include your expected CTC (INR LPA) and any variable/benefits expectations.
ApplicationPlease apply with your resume and links to relevant repos or code samples. Include concise notes on:
- a crawler you ran at 100+ sites/day (or similar scale),
- how you handled rate limits/retries, and
- your approach to PDF discovery/dedup.
-
Web Scraping Specialist
2 weeks ago
Bengaluru, Karnataka, India Hypersonix Full timePosition Overview We are seeking a highly skilled Web Scraping Architect to join our team The successful candidate will be responsible for designing implementing and maintaining web scraping processes to gather data from various online sources efficiently and accurately As a Web Scraping Specialist you will play a crucial role in collecting data for...
-
Web Scraping Specialist
3 days ago
Bengaluru, Karnataka, India Hypersonix Full time ₹ 12,00,000 - ₹ 36,00,000 per yearPosition Overview:We are seeking a highly skilled Web Scraping Architect to join our team. The successful candidate will be responsible for designing, implementing, and maintaining web scraping processes to gather data from various online sources efficiently and accurately. As a Web Scraping Specialist, you will play a crucial role in collecting data for...
-
Web Scraping Developer
2 weeks ago
Bengaluru, Karnataka, India Gmware Full time ₹ 1,50,000 - ₹ 28,00,000 per yearWe are hiring a Python Developer (0.52 yrs) for web scraping. Responsibilities: build & optimize scrapers, handle dynamic sites, proxies & CAPTCHAs, ensure data accuracy. Skills: Python, Scrapy, BeautifulSoup, Selenium, regex, debugging.Provident fundHealth insurance
-
Senior Web Scraping Engineer
2 weeks ago
Bengaluru, Karnataka, India TripleChoice Inc Full timeSenior Web Scraping Engineer (Python) — India (Remote)Employment type: Full-time (open to contract-to-hire) Work location: Remote in India Time overlap: Prefer 2–3 hours/day with Pacific Time (PST) About the roleWe're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer...
-
Senior Web Data Engineering Professional
2 weeks ago
Bengaluru, Karnataka, India beBeeWebScraping Full time ₹ 10,00,000 - ₹ 15,00,000Job Title:Web Scraping Solutions ArchitectAbout the Role:We are seeking a skilled Web Scraping Solutions Architect to design and implement high-throughput product data ingestion pipelines.Key Responsibilities:Design and implement end-to-end web scraping solutions using Python.Owning the crawling/extraction layer: HTTP-first crawling with a Playwright...
-
Senior Web Data Extraction Specialist
2 weeks ago
Bengaluru, Karnataka, India beBeeDataMining Full time ₹ 9,00,000 - ₹ 18,00,000Data Mining AnalystPosition: Data Mining AnalystThis role entails leveraging web scraping and data extraction expertise to drive business insights. We are seeking a skilled professional with expertise in automating data extraction processes from web platforms, utilizing tools like Python, Selenium, Pandas, SQL, and APIs.The ideal candidate will have the...
-
Senior Backend Engineer
2 weeks ago
Bengaluru, Karnataka, India Z Brands Full time ₹ 15,00,000 - ₹ 28,00,000 per yearAbout the RoleWe're looking for an experienced Senior Backend Engineer who understands what it takes to build scalable data-extraction systems and robust automation services.You'll lead the design and implementation of high-volume scraping pipelines, distributed crawlers, and resilient APIs that power our next-generation AI products.Key ResponsibilitiesWeb...
-
Senior Backend Engineer
2 weeks ago
Bengaluru, Karnataka, India Z Brands Full timeAbout the CompanyZ Brands is a profitable and rapidly growing consumer app studio. We leverage AI to analyze, design, and then build category-winning software products.Our team is Forbes 30u30, IIT, ex-MSFT, ex-PayTM and across our team we've built businesses that generate millions every year.About the RoleWe're looking for an experienced Senior Backend...
-
Senior Backend Engineer
2 weeks ago
Bengaluru, Karnataka, India Z Brands Full timeAbout the Company Z Brands is a profitable and rapidly growing consumer app studio. We leverage AI to analyze, design, and then build category-winning software products. Our team is Forbes 30u30, IIT, ex-MSFT, ex-PayTM and across our team we've built businesses that generate millions every year. About the Role We're looking for an experienced...
-
Data Ingestion Specialist
2 weeks ago
Bengaluru, Karnataka, India beBeeScraping Full time ₹ 90,00,000 - ₹ 1,25,00,000Senior Web Scraping Engineer PositionWe're developing a high-throughout product data ingestion pipeline across hundreds of domains. This position entails owning the crawling/extraction layer end-to-end.