
Site Reliability Engineer Associate
5 days ago
Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area? This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.
As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.
Job Responsibilities:
Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
Champion site reliability culture and practices and exerts technical influence throughout your team
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
Build and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.Engineer for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations.
Implement Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.
Required qualifications, capabilities, and skills:
Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
Experience with troubleshooting common networking technologies and issues
Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
Can effectively bridge the gap between ML engineers and infrastructure teams
Have excellent communication skills
Preferred qualifications, capabilities, and skills
Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.Understand ML model deployment strategies and their reliability implications
Have contributed to open-source infrastructure or ML tooling
Have experience with chaos engineering and systematic resilience testing
-
Site Reliability Engineer, AVP
3 days ago
Bengaluru, Karnataka, India RBS Full time ₹ 1,04,000 - ₹ 1,30,878 per yearJoin us as a Site Reliability EngineerYou'll manage the provision of stable, resilient, reliable applications with the end goal of minimising disruption to Customer & Colleague Journeys (CCJ)We'll look to you to identify and automate manual tasks and implement observability solutions, ensuring a thorough understanding of CCJ across applicationsThis is a...
-
Site Reliability Engineer III
2 days ago
Bengaluru, Karnataka, India Chase- Candidate Experience page Full time ₹ 15,00,000 - ₹ 20,00,000 per yearThere's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Employee Platforms team, you will solve complex and broad business problems with...
-
Site Reliability Engineer
7 days ago
Bengaluru, Karnataka, India Programming Full time ₹ 1,04,000 - ₹ 1,30,878 per yearRole - Site Reliability Engineering.Location - BengaluruYears of Expereince - 4+ YearsProfessional & Technical Skills:Must To Have Skills: Proficiency in Site Reliability Engineering.Good To Have Skills: Experience with cloud service providers such as AWS, Azure, or Google Cloud.Strong understanding of CI/CD tools and practices.Experience with container...
-
Site Reliability Engineer Associate
5 days ago
Bengaluru, Karnataka, India JPMorganChase Full time ₹ 1,04,000 - ₹ 1,30,878 per yearJOB DESCRIPTIONAre you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area? This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India Aerospike Full time ₹ 15,00,000 - ₹ 20,00,000 per yearAbout Aerospike Aerospike is the real-time database for mission-critical use cases and workloads, including machine learning, generative, and agentic AI. Aerospike powers millions of transactions per second with millisecond latency, at a fraction of the total cost of ownership compared to other databases. Global leaders, including Adobe, Airtel, Barclays,...
-
Site Reliability, Associate
2 weeks ago
Bengaluru, Karnataka, India Synopsys Full time US$ 90,000 - US$ 1,20,000 per yearCategory Information TechnologyHire Type EmployeeJob ID 12336Remote Eligible NoDate Posted 11/08/2025Alternate Job Titles: - HPC Engineer, Associate - Compute Farm Support Engineer - Linux Systems Engineer – HPC - Site Reliability Engineer – Entry Level - Infrastructure Operations AssociateWe Are:At Synopsys, we drive the innovations that shape the way...
-
Site Reliability Engineer III
2 weeks ago
Bengaluru, Karnataka, India Chase Bank Full timeJob DescriptionThere's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Commercial & Investment Bank, youwill solve complex and broad...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India Enterprise Minds, Inc Full timeWe're Hiring | Site Reliability Engineer | 8-10 years
-
Site Reliability Engineer
1 week ago
Bengaluru, Karnataka, India FOSS United Full time ₹ 1,04,000 - ₹ 1,30,878 per yearAll JobsSite Reliability Engineer at ZEISS IndiaSite Reliability EngineerApplyPosted on September 11, 2025ZEISS IndiaKadubeesanahalli, BengaluruFull TImeJob DescriptionZEISS in IndiaZEISS in India is headquartered in Bengaluru and present in the fields of Industrial Quality Solutions, Research Microscopy Solutions, Medical Technology, Vision Care and Sports...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India WhiteLotus Talent Partners Full timeWe are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes. In this role, you will focus on monitoring, basic troubleshooting, and incident response, helping to maintain high system...