MLOps Observability Engineer

4 days ago


Hyderabad, Telangana, India INTRAEDGE TECHNOLOGIES PRIVATE LIMITED Full time ₹ 6,00,000 - ₹ 12,00,000 per year

The MLOps Observability Engineer will design, implement, and maintain the comprehensive monitoring, logging, and tracing solutions for our entire ML platform and production models. This includes building automated systems to detect model decay, data drift, and infrastructure performance issues, ensuring that our AI/ML applications are reliable, scalable, and maintain continuous business value.

II. Key Responsibilities :

A. MLOps Monitoring and Model Health Design and Implement Model Observability : Define and implement Model Performance Monitoring metrics (e.g., accuracy, precision, recall, RMSE, AUC) and business-impact metrics for deployed ML models.

- Data and Concept Drift Detection : Build and automate data quality and validation checks to continuously monitor for data drift, concept drift, and data integrity issues that could degrade model performance.

- Explainability and Fairness Monitoring : Implement tools and techniques for model interpretability and model explainability (XAI), tracking feature importance, and monitoring for potential bias or fairness issues in production.

- Alerting and Triage : Establish clear, actionable alerting thresholds for model and infrastructure degradation, integrating with incident management workflows for quick triage and resolution.

B. Observability Platform and Infrastructure Telemetry Pipeline Development : Design, deploy, and manage robust Observability pipelines to collect, aggregate, and route the three pillars of observability (metrics, logs, and traces) from the ML platform and inference services.

- Dashboarding and Visualization : Create insightful and real-time dashboards (SLIs/SLOs) to provide a clear, unified view of the ML system's health, from infrastructure load to model prediction quality.

- Infrastructure-as-Code (IaC) for Observability : Use IaC tools to provision and manage the monitoring and logging infrastructure across cloud environments.

- Cost Optimization : Monitor telemetry data costs, implementing smart sampling and retention policies to ensure efficient use of observability tools.

C. Automation and CI/CD Automated Retraining Triggers : Integrate observability signals (like performance drop or data drift alerts) to automatically trigger the ML pipeline (CI/CD) for model retraining, testing, and redeployment.

- Reproducibility and Auditing : Ensure that model monitoring and all MLOps processes are fully reproducible, traceable, and adhere to governance and regulatory standards.

- Collaboration and Consultation : Work closely with Data Scientists and ML Engineers to instrument new models for observability from the ground up, educating them on best practices for monitoring and logging.

III. Technical Skills and Qualifications

A. Programming and Scripting Expert : Proficiency in Python is required, including libraries for data manipulation and ML (e.g., NumPy, Pandas, Scikit-learn).

- Strong Shell Scripting (Bash/Zsh) and experience with other languages like Go or Java is a plus.

B. MLOps Tools and Frameworks ML Frameworks : Familiarity with TensorFlow, PyTorch, or Scikit-learn to understand how models are built and served.

- MLOps Platforms/Tools : Hands-on experience with MLflow, Kubeflow, Data Version Control (DVC), or comparable solutions for experiment tracking and model registry.

- Orchestration : Experience with pipeline orchestration tools like Airflow, Kubeflow Pipelines, or Argo Workflows.

C. Cloud and Containerization Cloud Platforms : Deep working knowledge of one or more major cloud providers (AWS, GCP, or Azure) and their ML services (e.g., AWS SageMaker, Google AI Platform, Azure ML).

Containerization & Orchestration :

- Expertise with Docker and Kubernetes for deploying and managing production ML services.

- Infrastructure as Code (IaC) : Proficiency with Terraform or Ansible for infrastructure automation.

D. Monitoring and Observability Stack Metrics & Time-Series :

- Expertise with Prometheus and Grafana for collecting, querying, and visualizing time-series data.

- Logging & Tracing : Experience with centralized logging solutions (ELK Stack/Elasticsearch, Loki, Splunk) and distributed tracing tools (Jaeger, Zipkin, OpenTelemetry).

- Model Monitoring Tools : Experience with specialized model performance monitoring tools like Evidently AI, Seldon Core, or similar internal/commercial tools.

IV. Education and Experience :

Education : Bachelors or Masters degree in Computer Science, Software Engineering, Data Science, or a related technical field.

Experience : 3 years of experience in an MLOps, DevOps, SRE, or Observability-focused engineering role, with at least 1-2 years dedicated to production ML systems.

Soft Skills : Excellent problem-solving, analytical skills, and strong communication for collaborating effectively with cross-functional teams (Data Science, Software Engineering, Product).


  • MLops Engineer

    8 hours ago


    Hyderabad, Telangana, India Weekday AI Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    This role is for one of Weekday's clientsMin Experience: 5 yearsLocation: HyderabadJobType: full-timeRequirementsAt Techsophy, we are driving transformation for global enterprises with cutting-edge AI and automation. We are seeking an MLOps Engineer (with 5+ years of experience) who can bridge the gap between Machine Learning and DevOps, building scalable...

  • MLOps Engineer

    1 week ago


    Hyderabad, Telangana, India Transgraph Consulting Full time ₹ 15,00,000 - ₹ 28,00,000 per year

    Seeking an MLOps Engineer to design, deploy, and monitor ML systems. You'll ensure models are reliable, scalable, and easy to manage, while building tools that support teams and improve workflows. Required Candidate profileLooking for 3+ yrs exp in DevOps/MLOps/ML/Data Eng, strong Python, Git, CI/CD, Docker, K8s, cloud (AWS/GCP/Azure).Plus MLflow, Kubeflow,...


  • Hyderabad, Telangana, India Algoleap Technologies Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    SUMMARY Role: Observability EngineerJob Description:Senior Platform EngineerWe are seeking a highly experienced and driven Senior Observability Engineer to lead the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. As a Senior Observability Engineer, you will be at the forefront of...


  • Hyderabad, Telangana, India Mitchell Martin Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Role & responsibilitiesEssential DutiesInclude, but are not limited to, the following:Own productionizing modelsfrom tracked experiments to governed releases—ensuring resilient services with clear SLOs, runbooks, and fast, safe rollbacks.Build automation-first delivery: reproducible builds, layered tests, and environment promotion via GitLab CI and...

  • MLOps Engineer III

    2 weeks ago


    Hyderabad, Telangana, India Arroyo Consulting Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    OverviewSenior engineer with deep expertise in designing, automating, and scaling machine learning infrastructure. Provides mentorship to junior engineers and ensures operational excellence.ResponsibilitiesLead design of scalable MLOps frameworks and automation strategies.Optimize monitoring and alerting systems for drift, accuracy, and latency.Maintain...


  • Hyderabad, Telangana, India Mindlance Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Observability EngineerLocation:HyderabadJob Summary:We are seeking a highly skilled and motivatedGrafana Dashboard Specialistwith strong expertise in DevOps automation to join our team. The ideal candidate will be responsible for designing, developing, and maintaining advanced Grafana dashboards that provide actionable insights into system performance,...

  • MLOps Engineer II

    2 weeks ago


    Hyderabad, Telangana, India Arroyo Consulting Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    OverviewProficient MLOps engineer capable of independently managing production model deployments, pipelines, and infrastructure operations.Responsibilities:Deploy and maintain ML models in production using technologies like AWS SageMaker, MLflow, or Kubeflow.Manage pipelines and CI/CD workflows using tools like ArgoCD, Terraform, or similar...


  • Hyderabad, Telangana, India, Telangana Mindlance Full time

    Observability EngineerLocation: HyderabadJob Summary:We are seeking a highly skilled and motivated Grafana Dashboard Specialist with strong expertise in DevOps automation to join our team. The ideal candidate will be responsible for designing, developing, and maintaining advanced Grafana dashboards that provide actionable insights into system performance,...

  • Observability/AlOps

    2 weeks ago


    Hyderabad, Telangana, India IntraEdge Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    L2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...


  • Hyderabad, Telangana, India Xenon7 Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    About us:Where elite tech talent meets world-class opportunitiesAt Xenon7, we work with leading enterprises and innovative startups on exciting, cutting-edge projects that leverage the latest technologies across various domains of IT including Data, Web, Infrastructure, AI, and many others. Our expertise in IT solutions development and on-demand resources...