High-Performance Data Extraction Specialist

2 weeks ago


Thrissur, Kerala, India beBeeDataExtraction Full time ₹ 80,00,000 - ₹ 1,20,00,000

About the Role:

We're seeking a skilled High-Performance Data Extraction Specialist to lead our data ingestion pipeline. This high-throughput product data extraction project spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema). You will design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS-heavy pages.

This role involves key responsibilities such as:

  • Designing an HTTP-first crawler with Playwright fallback for JS-heavy pages.
  • Implementing sitemap diffing and conditional GETs for incremental runs.
  • Building a lightweight 'needs JS?' classifier to auto-route HTTP vs Playwright.
  • Enforcing per-domain throttles/backoff for concurrent/domain; auto-lower on 429/503.
  • Adding URL normalization/canonicalization and de-duplication.
  • Handling PDF discovery & download with HEAD first deduplication; size/concurrency caps; SHA-256 keys.
  • Applying Playwright browser automation resource budgets; block images/fonts/analytics; kill outliers by size/CPU/time.
  • Integrating third-party APIs as first-class sources: handle auth, pagination, and rate limits; unify API + crawl outputs.
  • Owning automation & orchestration for scheduled runs; idempotent retries; and alerting.
  • Creating per-domain selectors with verification on hold-outs; re-learn only when health drops.
  • Maintaining allow/deny paths; adhering to robots.txt and Terms of Service.
  • Containerizing workers; providing runbooks/CI; collaborating with data team on schemas/normalization.

Requirements:

  • 4+ years Python experience including 2+ years building production web crawlers at scale.
  • Strong expertise in Scrapy or aiohttp/asyncio and Playwright (or Puppeteer) in production.
  • Practical proxy management, polite anti-bot tactics, and per-domain rate limiting.
  • Hands-on experience with ETag/Last-Modified, retries, backoff, and HTTP caching.
  • Confidence with CSS/XPath, schema.org/JSON-LD, and HTML parsing.
  • APIs: consuming REST/GraphQL (auth, pagination, backoff) and building small internal services (FastAPI or similar).
  • Automation/Orchestration: Airflow/Temporal/Celery (or equivalent schedulers/queues) for scheduled runs and monitoring.
  • PDF handling (requests/HEAD, hashing, size limits) and file integrity checks.
  • Queues (Redis/Kafka), Docker, Linux basics; comfort with logs/metrics.
  • Clear, pragmatic communication and strong ownership.

Benefits:

We offer a competitive compensation package, flexible work environment, and opportunities for growth and development.

How We Work:

We prioritize shipping in small, measurable increments, tracking coverage and freshness as north-star metrics, and preferring simple designs that are easy to operate at scale.

Application:

Please submit your application with your resume, links to relevant repositories or code samples, concise notes on a crawler you ran at 100+ sites/day, how you handled rate limits/retries, and your approach to PDF discovery/deduplication.

],

  • Thrissur, Kerala, India beBeeDataEngineer Full time ₹ 1,56,00,000 - ₹ 2,42,40,000

    Data Engineering PositionWe are seeking a skilled Data Engineer to join our Professional Services team. The successful candidate will design, build, and maintain high-performance data pipelines that process large volumes of data.The role is critical in ensuring our data infrastructure can handle increasing volumes of data while maintaining exceptional...


  • Thrissur, Kerala, India beBeeDataQuality Full time ₹ 10,00,000 - ₹ 15,00,000

    Job Title: Data Quality Assurance SpecialistWe are seeking a highly skilled Data Quality Assurance Specialist with expertise in database engineering to join our data team. The ideal candidate will possess strong analytical skills, proficiency in writing complex SQL queries, and experience in testing ETL processes, data migrations, and reporting systems.Key...


  • Thrissur, Kerala, India beBeeProfessional Full time ₹ 15,00,000 - ₹ 25,00,000

    Healthcare Data ProfessionalWe are seeking a highly skilled professional to support various healthcare data projects, ranging from data extraction to complex analysis and strategic insights.The ideal candidate will have a comprehensive understanding of medical terminologies, clinical trial data, and healthcare coding, with the ability to collaborate across...


  • Thrissur, Kerala, India beBeeImage Full time ₹ 90,00,000 - ₹ 1,21,00,000

    As a pioneer in Earth Observation, we are seeking a highly skilled Optical Sensor Data Specialist to join our team.The ideal candidate will have expertise in developing and optimizing algorithms for key image processing stages, including radiometric and geometric correction, atmospheric correction, image enhancement, mosaicking, and orthorectification.You...


  • Thrissur, Kerala, India beBeeInfrastructure Full time ₹ 2,00,00,000 - ₹ 2,50,00,000

    Job Title: High Performance ComputingKey Responsibilities:Design, implement and maintain AI infrastructure to meet business needs.Configure and manage parallel file systems, networks, libraries and compilers.Deploy and manage monitoring tools for AI clusters to ensure optimal performance.Troubleshoot network issues using InfiniBand, ROCE switches, UFM and...


  • Thrissur, Kerala, India beBeeDatabaseAdministrator Full time ₹ 90,00,000 - ₹ 1,20,00,000

    Job Title: Database AdministratorA highly skilled database administrator is required to maintain and optimize large-scale databases for high-performance applications.Key Responsibilities:Ensure optimal database performance through efficient query optimization, indexing strategies, and tuning techniques.Implement and monitor robust database backup, recovery,...


  • Thrissur, Kerala, India RapidScan Full time

    B2B SaaS Sales Specialist (Contract)At RapidScan.AI, we're on a mission to eliminate manual data entry and document chaos for professionals. Our AI + OCR-powered document automation platform helps accountants, finance teams, and SMBs extract data from invoices, receipts, and documents shared via WhatsApp, Email, or Uploads — in seconds.No training. No...


  • Thrissur, Kerala, India beBeeDataManagement Full time ₹ 2,00,00,000 - ₹ 2,50,00,000

    Job Title: Enterprise Data Management Specialist Role SummaryThe Lead Enterprise Data Management Specialist will be responsible for overseeing the centralized data management function for land acquisition teams. This role ensures timely, accurate, and actionable reporting to support decision-making, performance tracking, and land regulatory compliance in...


  • Thrissur, Kerala, India beBeeIntegration Full time ₹ 18,00,000 - ₹ 25,00,000

    Cloud Integration SpecialistAt our organization, we are seeking a skilled Cloud Integration Specialist to join our team. In this role, you will be responsible for designing, building, and implementing integrations using cloud-based technologies.Design and implement integrations using REST, SOAP web services, FBDI, and HDL.Develop strong PL/SQL skills and...


  • Thrissur, Kerala, India beBeeData Full time ₹ 2,00,00,000 - ₹ 2,50,00,000

    Transforming Business with Intelligent InfrastructureAs a senior data engineer, you will play a pivotal role in designing and building intelligent infrastructure that powers enterprise transformation in complex telecommunications environments.Key responsibilities include:Design and develop robust ETL pipelines using DBT, SSIS, Informatica, or Talend for...