Senior Web Scraping Engineer

4 weeks ago


Alleppey, Kerala, India TripleChoice Inc Full time
Senior Web Scraping Engineer (Python) — India (Remote)

Employment type: Full-time (open to contract-to-hire)

Work location: Remote in India

Time overlap: Prefer 2–3 hours/day with Pacific Time (PST)

About the role

We're building a high-throughput product data ingestion pipeline across hundreds of domains. You'll own the crawling/extraction layer end‑to‑end: HTTP-first crawling with a Playwright fallback, per‑domain learned selectors, and reliable PDF handling (datasheets/specs). You'll also drive the automation around scheduling, retries, and monitoring so runs are hands‑off, and you'll integrate vendor/public APIs (REST/GraphQL) wherever available to complement crawling.

This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema).What you'll do
  • Design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS‑heavy pages.
  • Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
  • Build a lightweight "needs JS?" classifier (HTML length, JSON‑LD presence, data‑product markers) to auto‑route HTTP vs Playwright.
  • Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto‑lower on 429/503).
  • Add URL normalization/canonicalization and de‑dup (respect ; hash PDFs).
  • Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA‑256 keys).
  • Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
  • Integrate third‑party APIs (REST/GraphQL) as first‑class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs.
  • Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
  • Create per‑domain selectors (YAML) with verification on hold‑outs; re‑learn only when health drops.
  • Ship observability: per‑site field coverage, error rates, retries, avg page time, and PDF success.
  • Maintain allow/deny paths; adhere to robots.txt and Terms of Service.
  • Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.
Must‑have qualifications
  • 4+ years Python, including 2+ years building production web crawlers at scale.
  • Strong with Scrapy or aiohttp/asyncio and Playwright (or Puppeteer) in production.
  • Practical proxy management, polite anti‑bot tactics, and per‑domain rate limiting.
  • Hands‑on with ETag/Last-Modified, retries, backoff, and HTTP caching.
  • Confident with CSS/XPath, schema.org/JSON‑LD, and HTML parsing.
  • APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services (FastAPI or similar).
  • Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
  • PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
  • Queues (Redis/Kafka), Docker, Linux basics; comfort with logs/metrics.
  • Clear, pragmatic communication and strong ownership.
Nice to have
  • Go or Node.js experience for high‑performance crawlers.
  • Cloud: AWS/GCP, S3, ECS/Kubernetes; IaC basics.
  • Workflow engines: Airflow/Temporal/Argo/Celery.
  • Document extraction: Textract/Tika/Camelot/Tabula.
  • Search/analytics: Elasticsearch/OpenSearch; warehousing (Snowflake/Postgres).
  • LLM‑assisted selector generation with deterministic verification (optional).
How we work
  • Ship in small, measurable increments.
  • Track coverage and freshness as north‑star metrics.
  • Prefer simple designs that are easy to operate at scale.
Compensation

Competitive; please include your expected CTC (INR LPA) and any variable/benefits expectations.

Application

Please apply with your resume and links to relevant repos or code samples. Include concise notes on:

  1. a crawler you ran at 100+ sites/day (or similar scale),
  2. how you handled rate limits/retries, and
  3. your approach to PDF discovery/dedup.


  • Alleppey, Kerala, India Navalt Green Mobility Private Limited Full time ₹ 3,60,000 - ₹ 4,80,000 per year

    Job Description:The Senior Embedded Engineer – BMS will be a core member of the R&D team, responsible for the complete lifecycle of high-voltage Battery Management Systems (BMS) for marine electric solutions. The candidate will design, develop, and implement hardware and firmware, ensuring safety, efficiency, and longevity of the systems. This role...


  • Alleppey, Kerala, India Navalt Green Mobility Private Limited Full time ₹ 3,60,000 - ₹ 4,80,000 per year

    Job Description:The Senior Power Electronics Engineer will play a critical role in designing and optimizing motor control drives for electric marine propulsion systems. The candidate will manage the complete design lifecycle, ensuring controllers are robust, efficient, and compliant with demanding marine standards. This role involves close collaboration with...


  • Alleppey, Kerala, India ELDYNE Full time

    We're Hiring: Senior Manager – Design & Engineering Support for Digital Axle CounterLocation: KolkataAbout us:Eldyne, based in Kolkata since 1987, is a pioneer in modernised digital and automated Railway Signalling Systems. Our expertise spans concept-to-commissioning including design, development, integration, testing and maintenance. With over three...

  • Software Engineer

    6 days ago


    Alleppey, Kerala, India Sanbav Technologies Full time ₹ 2,40,000 - ₹ 7,20,000 per year

    About the Role:We are seeking a talented and motivatedFull Stack Developerwith strong experience inJavaScriptandC# (.NET)to join our dynamic team. You will be responsible for designing, developing, and maintaining scalable web applications, working across both front-end and back-end technologies. This role requires excellent problem-solving skills, attention...

  • AI Security Engineer

    4 weeks ago


    Alleppey, Kerala, India People Prime Worldwide Full time

    About CompanyOur client is a trusted global innovator of IT and business services. We help clients transform through consulting, industry solutions, business process services, digital & IT modernisation and managed services. Our client enables them, as well as society, to move confidently into the digital future. We are committed to our clients' long-term...

  • Project Engineer

    1 week ago


    Alleppey, Kerala, India Navalt Full time ₹ 2,16,000 - ₹ 2,40,000 per year

    Job Description:The Project Engineer will be involved in supervising and supporting construction projects, coordinating with different departments, assisting in material procurement, and maintaining project records. This role requires technical knowledge, attention to detail, and strong organizational skills. Freshers will receive guidance and training from...

  • Junior Test Engineer

    4 weeks ago


    Alleppey, Kerala, India Geesesquads Full time

    Location: Kochi, Kerala, On-siteExperience: 1 YearEmployment Type: Full-Time, Immediate JoinersWe are looking for a motivated and detail-oriented Software Test Engineer with 1 year of hands-on experience in manual testing and automation testing using Selenium. The ideal candidate should have strong problem-solving skills, an eye for detail, and the ability...


  • Alleppey, Kerala, India Hyovis Technologies & Water systems Full time ₹ 1,20,000 - ₹ 2,40,000 per year

    Job description:About the Role:We are looking for a passionate Embedded Systems Intern to join our engineering team. You will work closely with senior engineers to design, develop, and test embedded solutions for real-world applications. This is a great opportunity to gain hands-on experience in embedded hardware, firmware development, and IoT systems.Key...

  • AI Assurance Lead

    4 weeks ago


    Alleppey, Kerala, India T3 Full time

    T3 is a female-led market leader in AI for Governance, Risk & Compliance (GRC) and Responsible AI. In 2025, we won four awards for our work in Responsible AI. We have trained 20 governments and supported two of the three global Big Tech companies in advancing their AI governance and assurance programmes, alongside major financial institutions and...


  • Alleppey, Kerala, India Geekx Full time

    Lead Frontend Developer – Individual Contributor (Remote) Experience Required: 8+ Years Location: Remote (India) Joining: Immediate Only About the Role: We are seeking a highly experienced Frontend Developer to take ownership of end-to-end frontend architecture and development as an individual contributor . The role requires deep technical expertise,...