Senior Web Scraping Engineer

1 week ago


Alleppey, Kerala, India TripleChoice Inc Full time
Senior Web Scraping Engineer (Python) — India (Remote)

Employment type: Full-time (open to contract-to-hire)

Work location: Remote in India

Time overlap: Prefer 2–3 hours/day with Pacific Time (PST)

About the role

We're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer end‑to‑end: HTTP-first crawling with a Playwright fallback, per‑domain learned selectors, and reliable PDF handling (datasheets/specs). You'll also drive the automation around scheduling, retries, and monitoring so runs are hands‑off, and you'll integrate vendor/public APIs (REST/GraphQL) wherever available to complement crawling.

This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema).What you'll do
  • Design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS‑heavy pages.
  • Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
  • Build a lightweight "needs JS?" classifier (HTML length, JSON‑LD presence, data‑product markers) to auto‑route HTTP vs Playwright.
  • Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto‑lower on 429/503).
  • Add URL normalization/canonicalization and de‑dup (respect ; hash PDFs).
  • Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA‑256 keys).
  • Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
  • Integrate third‑party APIs (REST/GraphQL) as first‑class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs.
  • Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
  • Create per‑domain selectors (YAML) with verification on hold‑outs; re‑learn only when health drops.
  • Ship observability: per‑site field coverage, error rates, retries, avg page time, and PDF success.
  • Maintain allow/deny paths; adhere to robots.txt and Terms of Service.
  • Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.
Must‑have qualifications
  • 4+ years Python, including 2+ years building production web crawlers at scale.
  • Strong with Scrapy or aiohttp/asyncio and Playwright (or Puppeteer) in production.
  • Practical proxy management, polite anti‑bot tactics, and per‑domain rate limiting.
  • Hands‑on with ETag/Last-Modified, retries, backoff, and HTTP caching.
  • Confident with CSS/XPath, schema.org/JSON‑LD, and HTML parsing.
  • APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services (FastAPI or similar).
  • Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
  • PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
  • Queues (Redis/Kafka), Docker, Linux basics; comfort with logs/metrics.
  • Clear, pragmatic communication and strong ownership.
Nice to have
  • Go or Node.js experience for high‑performance crawlers.
  • Cloud: AWS/GCP, S3, ECS/Kubernetes; IaC basics.
  • Workflow engines: Airflow/Temporal/Argo/Celery.
  • Document extraction: Textract/Tika/Camelot/Tabula.
  • Search/analytics: Elasticsearch/OpenSearch; warehousing (Snowflake/Postgres).
  • LLM‑assisted selector generation with deterministic verification (optional).
How we work
  • Ship in small, measurable increments.
  • Track coverage and freshness as north‑star metrics.
  • Prefer simple designs that are easy to operate at scale.
Compensation

Competitive; please include your expected CTC (INR LPA) and any variable/benefits expectations.

Application

Please apply with your resume and links to relevant repos or code samples. Include concise notes on:

  1. a crawler you ran at 100+ sites/day (or similar scale),
  2. how you handled rate limits/retries, and
  3. your approach to PDF discovery/dedup.


  • Alleppey, Kerala, India beBeeDataScraping Full time ₹ 90,00,000 - ₹ 1,20,00,000

    Job Title: Data Scraping EngineerWe are seeking an experienced data scraping professional to join our team. The ideal candidate will have a minimum of 4 years of hands-on experience in IT scraping, with at least 2 years leading a team of 5+ developers.This role requires deep technical knowledge in advanced scraping techniques, reverse engineering,...


  • Alleppey, Kerala, India beBeeDevelopment Full time ₹ 1,00,00,000 - ₹ 1,50,00,000

    Web Development Expert WantedWe are seeking a skilled and ambitious developer to lead our long-term project involving advanced web scraping techniques, data automation, and complex system design.The ideal candidate will have experience with high-level programming languages, including Python and JavaScript, as well as proficiency in web development frameworks...


  • Alleppey, Kerala, India beBeeDataExtraction Full time ₹ 90,00,000 - ₹ 1,20,00,000

    About the RoleWe are seeking a skilled Data Extraction Specialist with expertise in automating data extraction processes from web platforms.The ideal candidate will be experienced in Python, Selenium, Pandas, SQL, and APIs, with the ability to design and implement efficient and scalable data scraping systems.Main Responsibilities:Design, develop, and...


  • Alleppey, Kerala, India beBeeEngineer Full time US$ 12,00,000 - US$ 15,00,000

    Web Scraping Engineer RoleWe are seeking a skilled Web Scraping Engineer to develop high-throughput web crawling and scraping solutions.This role involves designing and implementing web crawlers using Python, Scrapy, aiohttp, and Playwright.Design an HTTP-first crawler with a Playwright fallback for JavaScript-heavy pages.Implement sitemap diffing and...


  • Alleppey, Kerala, India beBeeCrawling Full time ₹ 10,00,000 - ₹ 15,00,000

    Job OpportunityWe are seeking a Web Crawling Specialist who will be responsible for developing and maintaining web crawlers, extracting valuable insights from the web, and ensuring data quality.The ideal candidate will have strong programming skills and experience in web scraping frameworks, browser automation tools, and handling anti-scraping...


  • Alleppey, Kerala, India beBeeSoftwareEngineer Full time ₹ 8,00,000 - ₹ 15,00,000

    Job OpportunityWe are seeking an experienced software engineer to design and optimize data extraction solutions.The ideal candidate will have expertise in web scraping, OCR, and building scalable scripts for performance and accuracy.Key Responsibilities:Develop and maintain high-quality scripts for web scraping from structured and unstructured...


  • Alleppey, Kerala, India beBeeDataEngineer Full time ₹ 20,00,000 - ₹ 25,00,000

    Job TitleAs a skilled Data Engineer, you will be responsible for building and maintaining scalable data pipelines, web scraping systems, and integrating them with our data infrastructure.Key Responsibilities:Design and implement efficient data pipelines using Python and popular libraries such as Scrapy and BeautifulSoup.Develop and maintain automated...


  • Alleppey, Kerala, India beBeeDataMining Full time ₹ 15,00,000 - ₹ 20,00,000

    As a Data Mining Analyst, you will play a pivotal role in streamlining data extraction processes from web platforms.Key Responsibilities:Design and develop robust web scraping solutions to extract structured and unstructured data from various websites and APIs.Utilize Python, Selenium, BeautifulSoup, Scrapy, and Pandas for data scraping and processing.Build...


  • Alleppey, Kerala, India beBeeData Full time ₹ 15,00,000 - ₹ 20,10,000

    Key Position:Data ArchitectRequired Experience:Minimum 2 YearsWork Environment:Remote Work AvailableAbout the Role:We are seeking an accomplished Data Scientist with exceptional Python expertise and hands-on experience in handling large datasets, data cleaning, analysis, and visualization. The ideal candidate should be capable of building efficient data...


  • Alleppey, Kerala, India beBeeDataEngineer Full time ₹ 8,00,000 - ₹ 15,00,000

    Job Opportunity: Data EngineerWe are seeking a highly skilled Data Engineer to join our team. The ideal candidate will have experience in designing, implementing and maintaining large-scale data systems.Responsibilities:Develop efficient data pipelines using Python, Pandas, NumPy and Scikit-learn.Design and implement web scraping solutions to extract...