Principal Site Reliability Engineer

6 days ago


bangalore, India Zscaler Full time
Job Description:
We are seeking an experienced SRE - AI/ML Engineer with a strong background in machine learning, AI-driven solutions, and cloud/SRE operations. In this role, you will design, implement, and optimize machine learning models to enhance system observability, proactive monitoring, and root cause analysis. You will collaborate with various teams to integrate AI/ML capabilities into our tooling, monitoring platforms, and decision-making processes to build a smarter, more resilient infrastructure.
Key Responsibilities
AI/ML Model Design and Implementation : Design, implement, and manage machine learning models for predicting system failures, performance degradation, and anomalies, enabling proactive and intelligent monitoring.
AI-Driven Observability : Integrate AI/ML capabilities into existing observability tools (e.g., Prometheus, Grafana, ELK Stack) to provide actionable insights, enhance anomaly detection, and enable automated responses.
Root Cause Analysis Tools : Develop AI-driven tools for intelligent debugging and root cause analysis by analyzing large volumes of logs, metrics, and traces to quickly identify issues.
Generative AI Integration : Implement generative AI models (e.g., GPT, BERT) for content generation, anomaly detection, and scenario simulations to enhance system reliability.
Chatbots and Conversational AI : Design and build conversational AI interfaces and chatbots to provide automated insights, support, and interactions for internal teams and users.
Collaboration with Data Engineering : Work closely with data engineering teams to collect, preprocess, and manage telemetry data for model training, validation, and real-time monitoring.
Model Deployment and Monitoring : Deploy machine learning models into production environments, ensuring their performance, accuracy, and reliability through continuous monitoring.
Automation and Innovation : Identify opportunities to automate incident detection and resolution mechanisms using AI/ML technologies, reducing downtime and improving system resilience.
Continuous Learning and Improvement : Stay updated with the latest trends in AI/ML and apply them to enhance observability and monitoring solutions.
Documentation and Knowledge Sharing : Maintain comprehensive documentation of AI/ML models, tools, and best practices to ensure knowledge sharing and compliance.
Required Skills and Qualifications
Experience : 8+ years of experience in SRE, AI/ML Engineering, Cloud-SRE/DevOps, with a focus on machine learning and Proactive and Intelligent Operations.
AI/ML Expertise : Proficiency in developing, training, and deploying machine learning models using frameworks such as TensorFlow, PyTorch, or Scikit-learn.
NLP and Generative AI : Experience with natural language processing (NLP) techniques and generative AI models (e.g., GPT, BERT) for content generation and chatbot development.
Programming : Strong programming skills in Python, Java, or Scala for AI/ML development and integration.
Observability Tools : Familiarity with observability tools like Prometheus, Grafana, and the ELK Stack, and experience enhancing them with AI/ML capabilities.
Data Processing : Knowledge of big data technologies (e.g., Apache Kafka, Apache Spark) and real-time data processing concepts.
Cloud Platforms : Experience with cloud platforms such as AWS, Azure, or Google Cloud for deploying and managing AI/ML solutions.
Analytical Skills : Strong analytical skills to interpret complex data, identify patterns, and solve technical challenges.
Collaboration : Excellent verbal and written communication skills for effective collaboration with cross-functional teams and stakeholders.
Why Zscaler?
• Work with cutting-edge technology in a dynamic and innovative environment.
• Enjoy a collaborative and inclusive company culture.
• Benefit from a competitive compensation and benefits package.
• Take advantage of our commitment to continuous learning and professional development.
• Join the world’s leading Software-as-a-Service Security Platform.
• Contribute to detecting 100 million threats daily across 185+ countries.
• Experience our exceptional workplace with a Glassdoor rating of 4.7/5.0 and 98% CEO approval.

  • bangalore, India Cisco Full time

    Principal Site Reliability Engineer, CloudOps at Cisco ThousandEyesBangalore, India Principal Site Reliability Engineer, Cloudops at Cisco ThousandEyes Who We Are The name ThousandEyes was born from two big ideas: the power to see things not ordinarily possible and the ability to collect insights from a multitude of vantage points. As the...


  • bangalore, India Oracle Full time

    Job title : Principal Site Reliability Engineering Job Description  Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team will focus on product development and product strategy for Oracle Health while building out a complete platform supporting modernized, automated...


  • bangalore, India Oracle Full time

    Job title : Principal Site Reliability Engineering Job Description  Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team will focus on product development and product strategy for Oracle Health while building out a complete platform supporting modernized, automated...


  • bangalore, India CAPCO Full time

    Principal Consultant - Site Reliability Engineer at Capco India - Bengaluru Sr SRE Consultant Joining Capco means joining an organisation that is committed to an inclusive working environment where you’re encouraged to #BeYourselfAtWork. We celebrate individuality and recognize that diversity and inclusion, in all forms, is critical to...


  • bangalore, India CAPCO Full time

    Principal Consultant - Site Reliability Engineer at Capco India - Bengaluru Sr SRE Consultant Joining Capco means joining an organisation that is committed to an inclusive working environment where you’re encouraged to #BeYourselfAtWork. We celebrate individuality and recognize that diversity and inclusion, in all forms, is critical to success....


  • bangalore, India Cricbuzz.com Full time

    Site Reliability EngineerWe are looking for a highly skilled and motivated Web Server Site Reliability Engineer to join our team. As a Web Server Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our web server infrastructure and CDN services.Experience - 3 - 5 yearsResponsibilities:● Design,...


  • bangalore, India Cricbuzz.com Full time

    Site Reliability EngineerWe are looking for a highly skilled and motivated Web Server Site Reliability Engineer to join our team. As a Web Server Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our web server infrastructure and CDN services.Experience - 4 - 5 yearsResponsibilities:● Design,...


  • bangalore, India Cricbuzz.com Full time

    Site Reliability Engineer We are looking for a highly skilled and motivated Web Server Site Reliability Engineer to join our team. As a Web Server Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our web server infrastructure and CDN services. Experience - 4 - 5 years Responsibilities: ●...


  • bangalore, India tsworks Full time

    Who We Are tsworks Technologies India Private Limited (subsidiary of The Software Works, Inc, USA) is a technology product and services company. Our mission is to provide domain expertise, innovative solutions and thought leadership to empower businesses to thrive in a digital world. We value our employees, take pride in providing best value in customer...


  • bangalore, India Integra Connect Full time

    About IntegraConnect Integra Connect delivers a comprehensive, integrated suite of cloud-based technologies and services that enable specialty groups to optimize clinical and financial performance as reimbursement shifts to value-based models. Connected by the IntegraCloud platform, the company’s core applications span population health including care...


  • bangalore, India Microsoft Full time

    Overview Looking to join an exciting industry and organization at the forefront of the next Tech industry transformation? Are you ready to join a team of the world’s best technical experts to enable the success of Microsoft solutions for our commercial & enterprise customers? We are seeking to build out the team of next generation Site Reliability...


  • bangalore, India Microsoft Full time

    Overview Looking to join an exciting industry and organization at the forefront of the next Tech industry transformation? Are you ready to join a team of the world’s best technical experts to enable the success of Microsoft solutions for our commercial & enterprise customers? We are seeking to build out the team of next generation Site Reliability...


  • bangalore, India Zensar Technologies Full time

    About the Role: Site Reliability Engineer Experience: 5-8Yrs Location: Bangalore Required Skills: Must have skills: - High level of experience using cloud log management and monitoring data platforms ( Dynatrace, Azure Monitor ) Hands on experience in Azure Bicep Experience working with Infrastructure as Code and Containerization tools ( Terraform , Docker,...


  • bangalore, India Cisco Full time

    Who We Are Cisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network – even the ones they don’t own. Powered by AI and an unmatched set of cloud, internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively...


  • bangalore, India Cisco Full time

    Who We Are Cisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network – even the ones they don’t own. Powered by AI and an unmatched set of cloud, internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect,...


  • Bangalore, India Qure.ai Full time

    About the job Job Title: Site Reliability Engineer Department: Engineering Location: Bangalore Years of experience: 2-5 years Type: Full Time Employment About Qure.ai: Qure.ai is one of the fastest-growing startups in India, which develops Artificial Intelligence enabled products and platforms for healthcare diagnostics. We create...


  • bangalore, India Integra Connect Full time

    About IntegraConnectIntegra Connect delivers a comprehensive, integrated suite of cloud-based technologies and services that enable specialty groups to optimize clinical and financial performance as reimbursement shifts to value-based models. Connected by the IntegraCloud platform, the company’s core applications span population health including care...


  • bangalore, India Zensar Technologies Full time

    About the Role: Site Reliability EngineerExperience: 5-8YrsLocation: BangaloreRequired Skills:Must have skills: -High level of experience using cloud log management and monitoring data platforms ( Dynatrace, Azure Monitor )Hands on experience in Azure BicepExperience working with Infrastructure as Code and Containerization tools ( Terraform , Docker,...


  • bangalore, India tsworks Full time

    Who We Are tsworks Technologies India Private Limited (subsidiary of The Software Works, Inc, USA) is a technology product and services company. Our mission is to provide domain expertise, innovative solutions and thought leadership to empower businesses to thrive in a digital world. We value our employees, take pride in providing best value in customer...


  • bangalore, India tsworks Full time

    Who We Aretsworks Technologies India Private Limited (subsidiary of The Software Works, Inc, USA) is a technology product and services company. Our mission is to provide domain expertise, innovative solutions and thought leadership to empower businesses to thrive in a digital world. We value our employees, take pride in providing best value in customer...