Data Ingestion Architect

2 days ago


Kottayam, Kerala, India beBeeArchitect Full time ₹ 5,00,000 - ₹ 10,00,000
Job Description:

About the Role

We're building a high-throughput data ingestion pipeline across hundreds of domains. The Data Ingestion Architect will own the crawling/extraction layer end-to-end, responsible for designing and implementing an HTTP-first crawler with Playwright fallback, per-domain learned selectors, and reliable PDF handling.

This role spans crawling (discovering & fetching pages via sitemaps/robots) and scraping (extracting structured specs, images, and PDFs into our schema). The successful candidate will drive automation around scheduling, retries, and monitoring to ensure runs are hands-off, and integrate vendor/public APIs wherever available to complement crawling.

  • Design an HTTP-first crawler (Scrapy or aiohttp) with Playwright fallback only for JS-heavy pages.
  • Implement sitemap diffing and conditional GETs (ETag/Last-Modified) for incremental runs.
  • Build a lightweight 'needs JS?' classifier (HTML length, JSON-LD presence, data-product markers) to auto-route HTTP vs Playwright.
  • Enforce per-domain throttles/backoff (2–4 concurrent/domain; auto-lower on 429/503).
  • Add URL normalization/canonicalization and de-dup (respect ; hash PDFs).
  • Handle PDF discovery & download (HEAD first to dedupe; size/concurrency caps; SHA-256 keys).
  • Apply Playwright browser automation resource budgets (block images/fonts/analytics; kill outliers by size/CPU/time).
  • Integrate third-party APIs (REST/GraphQL) as first-class sources: handle auth (API keys/OAuth2), pagination, and rate limits; unify API + crawl outputs.
  • Own automation & orchestration for scheduled runs (Airflow/Temporal/Celery or cron), idempotent retries, and alerting.
  • Create per-domain selectors (YAML) with verification on hold-outs; re-learn only when health drops.
  • Ship observability: per-site field coverage, error rates, retries, avg page time, and PDF success.
  • Maintain allow/deny paths; adhere to robots.txt and Terms of Service.
  • Containerize workers; provide runbooks/CI; collaborate with data team on schemas/normalization.

Required Skills:

Crawling and Scraping:

  • Experience with Scrapy or aiohttp crawlers.
  • Familiarity with Playwright and its capabilities.
  • Understanding of sitemap diffing and conditional GETs.
  • Able to implement URL normalization and canonicalization.

Data Integration and Automation:

  • Proficiency in integrating third-party APIs.
  • Knowledge of Airflow/Temporal/Celery or cron for automation and orchestration.
  • Experience with idempotent retries and alerting.

Collaboration and Communication:

  • Able to work collaboratively with data team on schema and normalization.
  • Excellent communication skills for effective collaboration.

Benefits:

  • Competitive salary.
  • Opportunities for professional growth and development.
  • A collaborative and dynamic work environment.


  • Kottayam, Kerala, India beBeeData Full time ₹ 15,00,000 - ₹ 18,00,000

    Job Title: Serverless Data ArchitectAbout the Role:The ideal candidate will be responsible for designing and implementing serverless architectures to support data ingestion and transformation.Key Responsibilities:Design and implement serverless architectures using AWS services to support data ingestion and transformation.Migrate clients towards ELT (Extract,...


  • Kottayam, Kerala, India beBeeData Full time US$ 1,30,000 - US$ 1,70,000

    As a Data Engineer, you will play a key role in our organization's data strategy.We are seeking an experienced professional to design, build, and maintain robust data pipelines that efficiently curate and ingest data.This includes working with cloud services to deploy and manage scalable data solutions, implementing and maintaining CI/CD pipelines for...

  • Data Architect

    4 days ago


    Kottayam, Kerala, India Pixeldust Technologies Full time

    Role Overview:We are seeking a highly skilled Data Architect - GCP with 6–8 years of experience in designing, developing, and managing enterprise data solutions on Google Cloud Platform (GCP). The ideal candidate will have a strong background in cloud data architecture, data warehousing, big data processing, and data integration, with proven expertise in...


  • Kottayam, Kerala, India beBeeData Full time ₹ 1,50,00,000 - ₹ 2,50,00,000

    Job DescriptionYou are invited to join our organization as a strategic architect, focusing on data and solutions.Our team specializes in making organizations AI-ready by efficiently structuring their data across various sources and enabling AI-driven solutions.We concentrate on developing data architecture, pipelines, governance, MLOps, and Gen AI solutions...


  • Kottayam, Kerala, India beBeeEngineering Full time ₹ 2,10,00,000 - ₹ 2,58,75,000

    Job Title: Data Platform Engineering LeadAbout the Role:We are seeking a skilled Data Platform Engineering Lead to design and scale our core data foundation.The successful candidate will lead the development of clean, scalable models, drive data platform strategy, and enable a high-velocity product development culture.Key responsibilities include...


  • Kottayam, Kerala, India beBeeDataArchitect Full time ₹ 1,50,00,000 - ₹ 2,00,00,000

    Our company is seeking an experienced data architect to lead our data architecture and modeling efforts. The ideal candidate will have a proven track record of designing and implementing end-to-end data architectures using cloud-based technologies.Key Responsibilities:Design and implement scalable data pipelines leveraging cloud-based platforms for data...


  • Kottayam, Kerala, India beBeeData Full time ₹ 1,50,00,000 - ₹ 2,50,00,000

    Job Summary:We are seeking a highly skilled data architect to design, develop and manage enterprise data solutions on Google Cloud Platform.Design end-to-end data solutions on GCP aligning with business and technical requirements.Create data models, storage strategies, data ingestion, processing and consumption frameworks.Implement data lakes, warehouses and...


  • Kottayam, Kerala, India beBeeDataEngineer Full time ₹ 15,00,000 - ₹ 22,50,000

    About the RoleAs a Data Engineer, you will be responsible for delivering data-driven solutions to customers worldwide.You will work on implementing and deploying innovative data products that provide insights into material handling systems performance. You will collaborate with a multidisciplinary team to design end-to-end data ingestion pipelines and...


  • Kottayam, Kerala, India beBeeData Full time ₹ 1,50,00,000 - ₹ 2,00,00,000

    Job Title:Cloud Data EngineerAs a Cloud Data Engineer, you will design and develop scalable data pipelines using cloud-native tools. You will architect and implement data lakes and data warehouses on cloud platforms, developing and optimizing data ingestion, transformation, and loading processes.Key Responsibilities:Design, develop, and maintain scalable ETL...

  • Data Architect

    1 week ago


    Kottayam, Kerala, India beBeeData Full time ₹ 15,00,000 - ₹ 20,00,000

    Job Role: Data Solutions SpecialistOverview:We are seeking a skilled professional to design, develop, and maintain efficient data pipelines. The ideal candidate will have expertise in data engineering, cloud technologies, and data architecture, collaborating closely with cross-functional teams.Key Responsibilities:Design and implement scalable data workflows...