Site Reliability Engineer Associate

5 days ago

Bengaluru, Karnataka, India hackajob Full time ₹ 12,00,000 - ₹ 36,00,000 per year

hackajob*
is collaborating with
J.P. Morgan*
to connect them with exceptional tech professionals for this role.
Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area? This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.

As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.

Job Responsibilities

Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
Champion site reliability culture and practices and exerts technical influence throughout your team
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers. Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident Build and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
Engineer for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations.
Implement Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.

Required Qualifications, Capabilities, And Skills

Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
Experience with troubleshooting common networking technologies and issues
Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Can effectively bridge the gap between ML engineers and infrastructure teams Have excellent communication skills

Preferred Qualifications, Capabilities, And Skills

Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways. Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.
Understand ML model deployment strategies and their reliability implications
Have contributed to open-source infrastructure or ML tooling
Have experience with chaos engineering and systematic resilience testing

Site Reliability Engineering

2 weeks ago

Bengaluru, Karnataka, India Viraaj HR Solutions Private Limited Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Site Reliability Engineer (SRE)About The OpportunityA fast-growing organization in the Enterprise Cloud Infrastructure & SaaS sector delivering highly available, mission-critical services to enterprise customers. We are hiring an on-site Site Reliability Engineer in India to own reliability, automation, and operational excellence across cloud-native...
Site Reliability Engineer

2 days ago

Bengaluru, Karnataka, India RBS Full time ₹ 8,00,000 - ₹ 16,00,000 per year

Join us as a Site Reliability EngineerIn this key role, you'll support the improvement of non-functional and operational characteristics such as availability, performance, efficiency, change management, monitoring, security, incident response, and capacity planning of our products and servicesYou'll enjoy significant stakeholder interaction, working in...
Associate Site Reliability Engineer

2 weeks ago

Bengaluru, Karnataka, India Alteryx Full time ₹ 6,00,000 - ₹ 18,00,000 per year

We're looking for problem solvers, innovators, and dreamers who are searching for anything but business as usual. Like us, you're a high performer who's an expert at your craft, constantly challenging the status quo. You value inclusivity and want to join a culture that empowers you to show up as your authentic self. You know that success hinges on...
Site Reliability Engineer

2 days ago

Bengaluru, Karnataka, India Walmart Full time ₹ 12,00,000 - ₹ 36,00,000 per year

About Team:Transactional System provides core transactional systems to enable segment and technology partners in creating wonderful omni experiences with speed and leverage. We are a highly motivated group of engineers, working in an agile group to solve sophisticated and high impact problems. This role is part of Cloud Powered Checkout team and will build...
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India Empower Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and...
Site Reliability Engineer

2 days ago

Bengaluru, Karnataka, India Karix Full time ₹ 13,00,000 - ₹ 34,00,000 per year

Role:Site Reliability EngineerLocation:Bangalore (WFO)About the role:We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Service platforms. They are responsible for monitoring, automating, and improving the reliability,...
Site Reliability Engineer, AVP

2 days ago

Bengaluru, Karnataka, India NatWest Group Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Join us as a Site Reliability EngineerYou'll manage the provision of stable, resilient, reliable applications with the end goal of minimising disruption to Customer & Colleague Journeys (CCJ)We'll look to you to identify and automate manual tasks and implement observability solutions, ensuring a thorough understanding of CCJ across applicationsThis is a...
Site Reliability Engineer, AVP

5 days ago

Bengaluru, Karnataka, India RBS Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Join us as a Site Reliability EngineerYou'll manage the provision of stable, resilient, reliable applications with the end goal of minimising disruption to Customer & Colleague Journeys (CCJ)We'll look to you to identify and automate manual tasks and implement observability solutions, ensuring a thorough understanding of CCJ across applicationsThis is a...
Site Reliability Engineer II

2 days ago

Bengaluru, Karnataka, India JPMorganChase Full time ₹ 6,00,000 - ₹ 12,00,000 per year

DescriptionAs a Site Reliability Engineer II at JPMorgan Chase within Corporate Technology, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and...
Site Reliability Engineer II

9 hours ago

Bengaluru, Karnataka, India JPMorganChase Full time ₹ 50,00,000 - ₹ 1,05,00,000 per year

JOB DESCRIPTIONAs a Site Reliability Engineer II at JPMorgan Chase within Corporate Technology, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer Associate