
▷ [Apply Now] Site Reliability Engineer AIML - Associate
6 days ago
Job Description
Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.
As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.
Job Responsibilities:
- Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
- Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
- Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
- Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
- Champion site reliability culture and practices and exerts technical influence throughout your team
- Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
- Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
Build and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
- Engineer for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
- Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations.
- Implement Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.
Required qualifications, capabilities, and skills:
- Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
- Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
- Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
- Experience with troubleshooting common networking technologies and issues
- Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
- Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
- Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
Can effectively bridge the gap between ML engineers and infrastructure teams
Have excellent communication skills
Preferred qualifications, capabilities, and skills
- Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
- Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
- Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
- Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.
- Understand ML model deployment strategies and their reliability implications
- Have contributed to open-source infrastructure or ML tooling
- Have experience with chaos engineering and systematic resilience testing
-
Bengaluru, India Chase Bank Full timeJob Description There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Commercial & Investment Bank, youwill solve complex and broad...
-
Mumbai, India Natobotics Full timeJob Description Were on an exciting journey with our client and we want you to join us. With our client, you will be exposed to the latest technologies and work with some of the brightest minds in the industry. Our client is leading Banking company so you will be playing a key role as a VP Site Reliability Engineering (SRE), who can assist with the...
-
Site Reliability Engineer III
3 weeks ago
Bengaluru, Karnataka, India JP Morgan Chase & Co. Full timeJob DescriptionThere's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Employee Platforms team, youwill solve complex and broad business...
-
Site Reliability Engineer III
1 week ago
Bengaluru, India JPMorganChase Full timeJOB DESCRIPTION There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Employee Platforms team, you will solve complex and broad business...
-
Site Reliability Engineer III
2 days ago
Hyderabad, India Chase Bank Full timeJob Description There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Consumer and Community Banking, youwill solve complex and broad...
-
Hyderabad, India Splunk Full timeJob Description Job Description Join us as we pursue our ground-breaking vision to make machine data accessible, usable, and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we are committed to our work, customers, having fun, and most...
-
Site Reliability Engineer
2 days ago
Chennai, India Trimble Inc. Full timeJob Description Cloud Site Reliability Engineer Reporting to: Sr Manager, Availability Management Office Location: Chennai, India Flexible Working: Hybrid (Part Office/Part Home) Cloud Site Reliability Engineer Responsibilities - AI in Observability: Heavily utilise migration tooling and AI to eliminate key tasks as well as optimising the collection,...
-
Site Reliability Engineer
1 week ago
india Synechron Full timeWe have immediate opportunity forSRE (Senior Site Reliability Engineer) 5 to 9 years. Synechron –BangaloreJob Role: -SRE (Senior Site Reliability Engineer) Job Location: -Bangalore Notice Period:Within 30daysAbout Synechron We began life in 2001 as a small, self-funded team of technology specialists. Since then, we’ve grown our organization to 14,500+...
-
Applied AIML
1 week ago
Bengaluru, Karnataka, India JPMC Candidate Experience page Full time US$ 1,50,000 - US$ 2,00,000 per yearAs an Applied AI/GenAI ML Director within the Asset and Wealth Management Technology Team at JPMorgan Chase, you will provide deep engineering expertise and work across agile teams to enhance, build, and deliver trusted market-leading technology products in a secure, stable, and scalable way. You will leverage your deep expertise to consistently challenge...
-
Urgent Search Site Reliability Engineer Iii
3 weeks ago
Bengaluru, Karnataka, India JPMorgan Chase Full timeJob Category Software Engineering There s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world s most complex and mission-critical systems As a Site Reliability Engineer III at JPMorgan Chase within the Employee Platforms team you will solve...