Observability Engineer

1 week ago


Hyderabad, India Kiash Solutions LLp Full time

Observability Engineer (Dashboarding & Analytics Developer)

The JedAI team is at the forefront of developing cutting-edge generative AI platforms that connect to Large Language Models (LLMs), agents, knowledge bases, and Multi-Channel Processing (MCP) servers. Our mission is to harness the power of generative AI to deliver innovative solutions that drive efficiency, safety, and intelligence across various applications.

Job Description :

- We are seeking a highly skilled Dashboarding and Analytics Developer to join our JedAI team. In this role, you will be responsible for the visualization and development of Key Performance Indicators (KPIs) that are critical to monitoring and enhancing the performance of our generative AI systems.

- You will develop and maintain comprehensive dashboards that provide real-time insights into the performance of LLMs, Retrieval Augmented Generation (RAG) systems, safety mechanisms, other generative AI features, billing, token consumption, and many more.

- Dashboard Development: Design, develop, and maintain interactive and user-friendly dashboards for monitoring AI system performance.

- KPI Identification: Collaborate with cross-functional teams to define and implement KPIs related to LLMs, RAG systems, safety protocols, and other AI features.

- Data Visualization: Create clear and insightful visualizations that communicate complex data trends and patterns effectively to stakeholders.

- Performance Monitoring: Continuously monitor AI system metrics to identify anomalies, performance issues, and areas for improvement.

- Data Analysis: Analyze large and complex datasets to extract meaningful insights that support decision-making processes.

- Collaboration: Work closely with AI engineers, data scientists, and product managers to align dashboard functionalities with project goals.

- Innovation: Stay updated with the latest trends and technologies in data visualization and analytics to introduce innovative solutions.

- Documentation: Maintain thorough documentation of dashboard configurations, data sources, and visualization methodologies.

Details of work :

1. Performance Metrics :

- Latency and Throughput: Monitor the response times and the number of requests processed per unit time to ensure the system meets performance expectations.

- Resource Utilization: Track CPU, memory, disk I/O, and network bandwidth usage to identify bottlenecks or inefficiencies.

2. Model Performance and Drift Monitoring:

- Accuracy Metrics: Keep track of model accuracy, precision, recall, F1 score, etc., to ensure the models are performing as expected.

- Data and Concept Drift Detection: Monitor for changes in data distribution that could affect model performance over time.

- Feature Importance Tracking: Observe changes in feature importance to understand and explain model predictions.

3. Anomaly Detection:

- Implement systems to detect unusual patterns or outliers in data inputs, user behavior, or system performance, which could indicate errors or security issues.

4. Security Monitoring :

- Dashboarding & Analytics Developer

- Access Logs: Maintain detailed logs of user access and actions for security auditing.

- Threat Detection: Use intrusion detection systems (IDS) to identify potential security threats.

- Compliance Monitoring: Ensure adherence to regulations like GDPR, HIPAA, or other industry-specific compliance requirements.

5. User Engagement and Feedback :

- Usage Analytics: Analyze how users interact with the system to improve user experience.

- Feedback Collection: Provide mechanisms for users to report issues or suggest improvements.

- Session Tracking: Monitor user sessions to understand behavior patterns and enhance personalization.

6. Error Handling and Logging:

- Detailed Error Logs: Capture and categorize errors to facilitate quicker debugging and resolution.

- Automated Alerting: Set up alerts for critical failures or error rate thresholds being exceeded.

7. Audit Trails and Traceability:

- Transaction Logging: Keep records of all transactions and changes in the system for accountability.

- Version Control Tracking: Monitor changes in models, code, or configurations to track the evolution of the system.

8. Data Quality Monitoring:

- Validation Checks: Ensure incoming data meets quality standards before processing.

- Missing or Corrupted Data Detection: Identify and handle incomplete or corrupted data inputs.

9. Scalability Metrics:

- Load Testing Metrics: Assess how the system performs under various load conditions to plan for scaling.

- Auto-Scaling Monitoring: Monitor the effectiveness of auto-scaling policies in cloud environments.

10. Cost Management:

- Resource Cost Analysis: Monitor the costs associated with compute, storage, and network resources to optimize spending.

- Budget Alerts: Set up alerts when spending exceeds predefined budgets.

11. Deployment and CI/CD Pipeline Monitoring:

- Deployment Success Rates: Track the success or failure of deployments.

- Pipeline Performance: Monitor the CI/CD pipeline for bottlenecks or failures.

12. Compliance and Governance:

- Policy Enforcement: Ensure data usage and model deployment adhere to organizational policies.

- Role-Based Access Control (RBAC): Implement and monitor access controls for different system components.

13. Disaster Recovery and Backup Monitoring:

- Backup Integrity Checks: Regularly verify backups to ensure data can be recovered when needed.

- Recovery Time Objectives (RTO) Monitoring: Ensure systems can be restored within acceptable time frames after outages.

14. Customer Support Integration:

- Ticketing System Integration: Monitor support tickets related to the system to identify common issues.

- Service Level Agreement (SLA) Compliance: Track metrics to ensure SLAs are being met.

15. Visualization and Reporting :

- Custom Dashboards: Create dashboards tailored to different stakeholders executives, developers, support teams.

- Scheduled Reports: Automate reporting on key metrics for regular review.

Some tools & skills preferred but does need to check all the boxes Technical Domain experience of AI LLMs, Retrieval Augmented Generation (RAG) systems, safety mechanisms, other generative AI features, billing, token consumption, and many more.

- Data Visualization Tools: Tableau, Power BI, Grafana, Splunk

- Programming Languages: Python, JQL, SPL

- Data Query Languages: SQL

- Cloud Platforms: AWS, Azure, GCP (Likely if Auto-Scaling is a key responsibility)

- Monitoring Tools: Prometheus, Datadog, New Relic, CloudWatch (AWS), Azure Monitor

- Version Control Systems: Git

- Ticketing Systems: Jira, Zendesk, ServiceNow

- Logging Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk

- CI/CD Tools: Jenkins, GitLab CI, CircleCI, GitHub Actions

(ref:hirist.tech)

  • Hyderabad, Telangana, India Algoleap Technologies Full time ₹ 5,00,000 - ₹ 8,00,000 per year

    SUMMARY Role: Observability EngineerJob Description:Senior Platform EngineerWe are seeking a highly experienced and driven Senior Observability Engineer to lead the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. As a Senior Observability Engineer, you will be at the forefront of...


  • Hyderabad, India Algoleap Technologies Full time

    SUMMARY Role: Observability Engineer Job Description:Senior Platform Engineer We are seeking a highly experienced and driven Senior Observability Engineer to lead the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. As a Senior Observability Engineer, you will be at the forefront of...


  • Hyderabad, India algoleap Full time

    Role: Observability Engineer Job Description:Senior Platform Engineer We are seeking a highly experienced and driven Senior Observability Engineer to lead the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. As a Senior Observability Engineer, you will be at the forefront of implementing...


  • Hyderabad / Secunderabad, Telangana, India beBeeEngineer Full time ₹ 20,00,000 - ₹ 25,00,000

    Job DescriptionWe are seeking a highly experienced Observability Engineer Leader to join our team. The successful candidate will be responsible for overseeing the design, implementation, and maintenance of our Observability platform.The Observability Engineer Leader will lead a team of engineers in implementing AI Ops capabilities within Grafana and Elastic,...


  • Greater Hyderabad Area, India GuhaTek Full time ₹ 1,40,000 - ₹ 28,00,000 per year

    Company DescriptionGuhaTek is dedicated to pioneering innovative site reliability engineering (SRE) practices and developing tools that empower businesses to achieve exceptional reliabiltiy, performance, and operational efficiency. Our mission is to maximize digital success for our clients through cutting-edge technology and expert guidance. Join us to be a...


  • Hyderabad, India Data Economy Full time

    Job Summary: We are seeking an experienced Observability Engineer with a strong DevOps background to design, implement, and manage observability solutions across cloud and on-prem environments. The ideal candidate will have expertise in monitoring, logging, tracing, and alerting to ensure high system availability, performance, and reliability. Key...

  • Observability/AlOps

    3 weeks ago


    Hyderabad, Telangana, India IntraEdge Full time

    L2- Observability/AIOps (5 to 8 yrs exp). Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...

  • Observability/AlOps

    3 weeks ago


    Hyderabad, Telangana, India IntraEdge Full time

    L2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...

  • Observability/AlOps

    2 weeks ago


    Hyderabad, Telangana, India IntraEdge Full time

    L2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...

  • Observability/AlOps

    2 weeks ago


    Hyderabad, Telangana, India IntraEdge Full time US$ 1,25,000 - US$ 1,75,000 per year

    L2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...