Site Reliability Engineer II

5 days ago

Bengaluru, Karnataka, India American Express Full time ₹ 20,00,000 - ₹ 25,00,000 per year

You Lead the Way. We've Got Your Back.

With the right backing, people and businesses have the power to progress in incredible ways. When you join Team Amex, you become part of a global and diverse community of colleagues with an unwavering commitment to back our customers, communities and each other. Here, you'll learn and grow as we help you create a career journey that's unique and meaningful to you with benefits, programs, and flexibility that support you personally and professionally.

At American Express, you'll be recognized for your contributions, leadership, and impact—every colleague has the opportunity to share in the company's success. Together, we'll win as a team, striving to uphold our company values and powerful backing promise to provide the world's best customer experience every day. And we'll do it with the utmost integrity, and in an environment where everyone is seen, heard and feels like they belong.

Join Team Amex and let's lead the way together.

How will you make an impact in this role?

We are seeking an experienced Site Reliability Engineer to join our Generative AI infrastructure team. This role focuses on ensuring the reliability, scalability, and performance of our RAG (Retrieval-Augmented Generation) systems and agentic AI architectures. The ideal candidate will have 5+ years of SRE experience with specialized expertise in AI/ML infrastructure, particularly in production deployment and operation of large language models, vector databases, and autonomous agent systems.

Key Responsibilities

AI Infrastructure Management & Reliability

Design, deploy, and maintain highly available RAG pipelines including vector databases, embedding services, and LLM inference infrastructure
Ensure reliable operation of agentic AI systems including multi-agent orchestration platforms, tool integration frameworks, and decision-making workflows
Implement comprehensive monitoring and observability for AI model performance, token usage, latency, and accuracy metrics
Lead incident response for AI system outages, including model degradation, vector search failures, and agent execution issues

RAG System Operations

Optimize and maintain vector database infrastructure (Pinecone, Weaviate, Chroma, or similar) for high-performance similarity search at scale
Manage embedding model deployments and ensure consistent document ingestion pipelines with proper chunking and preprocessing
Implement retrieval quality monitoring, including relevance scoring and context window optimization
Design and maintain hybrid search systems combining vector and traditional search methodologies

Agentic Architecture Reliability

Build and maintain infrastructure for autonomous agent systems including planning, reasoning, and tool execution frameworks
Implement robust error handling and fallback mechanisms for agent decision chains and multi-step workflows
Monitor and optimize agent performance metrics including success rates, execution time, and resource utilization
Ensure secure and reliable integration between agents and external APIs, databases, and services

MLOps & Platform Engineering

Develop Infrastructure as Code solutions for AI/ML workloads including GPU clusters, model serving infrastructure, and data pipelines
Build automated deployment pipelines for LLM fine-tuning, RAG system updates, and agent workflow modifications
Implement A/B testing frameworks for AI system improvements and model version management
Design capacity planning and auto-scaling solutions for variable AI workloads and inference demands

Required Skills & Experience

Generative AI & ML Infrastructure

5+ years of SRE/DevOps experience with 2+ years specifically focused on AI/ML production systems
Deep hands-on experience with RAG architecture implementation including vector databases, embedding models, and retrieval systems
Proven experience with agentic AI frameworks (LangChain, LlamaIndex, AutoGPT, CrewAI, or similar) and multi-agent orchestration
Strong understanding of LLM deployment and optimization including model serving frameworks (vLLM, TensorRT-LLM, Triton) and GPU infrastructure management

Vector & Search Technologies

Proficiency with vector database technologies (PgVector, Pinecone, Weaviate, Qdrant, Chroma, Milvus) and their operational requirements
Experience with embedding models (OpenAI, Sentence Transformers, Cohere) and semantic search optimization
Knowledge of hybrid search implementations combining vector, keyword, and graph-based retrieval methods
Understanding of chunking strategies, document preprocessing, and knowledge graph integration

AI System Monitoring & Observability

Experience implementing AI-specific monitoring including model drift detection, hallucination tracking, and response quality metrics
Proficiency with MLOps tools (MLflow, Weights & Biases, Neptune) and experiment tracking systems
Knowledge of AI system debugging including prompt tracing, agent execution visualization, and performance bottleneck identification
Understanding of AI safety monitoring including content filtering, bias detection, and usage pattern analysis

Infrastructure & Cloud Platforms

Proficiency with cloud AI services (AWS SageMaker, Google Vertex AI, Azure ML) and their operational aspects
Advanced Kubernetes experience including GPU scheduling, resource quotas, and AI workload optimization
Experience with container technologies optimized for ML workloads and model serving

Programming & Automation

Proficient in Python with deep understanding of AI/ML libraries (transformers, langchain, llamaindex, torch, numpy)
Experience with Infrastructure as Code tools (Terraform, Helm) specifically for AI infrastructure provisioning
Strong API design and integration skills for AI service orchestration and tool integration
Knowledge of streaming and async processing for real-time AI applications

Specialized Experience

RAG Systems

Production experience with document ingestion pipelines, chunking strategies, and metadata management
Understanding of retrieval quality optimization including re-ranking, query expansion, and context selection
Experience with multi-modal RAG systems incorporating text, images, and structured data
Knowledge of RAG evaluation frameworks and automated quality assessment

Agentic Architecture

Hands-on experience with agent planning algorithms, tool selection mechanisms, and execution engines
Understanding of multi-agent coordination, communication protocols, and distributed agent systems
Experience with agent memory systems, state management, and long-running workflow orchestration
Knowledge of agent safety mechanisms including execution sandboxing and output validation

Preferred Qualifications

Experience with fine-tuning and RLHF (Reinforcement Learning from Human Feedback) infrastructure
Knowledge of edge AI deployment and model optimization techniques
Familiarity with AI governance, compliance frameworks, and ethical AI implementation
Experience with conversational AI platforms and dialogue management systems
Understanding of knowledge graphs and symbolic reasoning integration with neural systems

We back you with benefits that support your holistic well-being so you can be and deliver your best. This means caring for you and your loved ones' physical, financial, and mental health, as well as providing the flexibility you need to thrive personally and professionally:

Competitive base salaries
Bonus incentives
Support for financial-well-being and retirement
Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location)
Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
Generous paid parental leave policies (depending on your location)
Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
Free and confidential counseling support through our Healthy Minds program
Career development and training opportunities

American Express is an equal opportunity employer and makes employment decisions without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, disability status, age, or any other status protected by law.

Offer of employment with American Express is conditioned upon the successful completion of a background verification check, subject to applicable laws and regulations.

Site Reliability Engineer II

5 days ago

Bengaluru, Karnataka, India JPMorganChase Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Play a key role in ensuring system reliability at one of the world's most iconic and largest financial institutions.As a Site Reliability Engineer II at JPMorgan Chase within the Chief Administrative Office - Global Real Estate Technology, you will use technology to solve business problems and leverage software engineering best practices as we strive towards...
Site Reliability Engineer

2 days ago

Bengaluru, Karnataka, India FOSS United Full time ₹ 12,00,000 - ₹ 36,00,000 per year

All JobsSite Reliability Engineer at ZEISS IndiaSite Reliability EngineerApplyPosted on September 11, 2025ZEISS IndiaKadubeesanahalli, BengaluruFull TImeJob DescriptionZEISS in IndiaZEISS in India is headquartered in Bengaluru and present in the fields of Industrial Quality Solutions, Research Microscopy Solutions, Medical Technology, Vision Care and Sports...
Software Engineer II, Reliability Engineering

5 days ago

Bengaluru, Karnataka, India NIKE Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Site Reliability Engineer IIIndia Technology CenterWHO YOU'LL WORK WITHYou will be a part of a team of talented Site Reliability Engineers focused on delivering reliabile and observable software used by millions of athletes* around the world. You will be a part of the Resilience Engineering organization which includes Reliability Engineering, Live Site...
Site Reliability Engineer II

2 weeks ago

Bengaluru, Karnataka, India UiPath Full time ₹ 15,00,000 - ₹ 25,00,000 per year

Life at UiPathThe people at UiPath believe in the transformative power of automation to change how the world works. We're committed to creating category-leading enterprise software that unleashes that power.To make that happen, we need people who are curious, self-propelled, generous, and genuine. People who love being part of a fast-moving, fast-thinking...
Site Reliability Engineer II

2 weeks ago

Bengaluru, Karnataka, India UiPath Full time ₹ 15,00,000 - ₹ 25,00,000 per year

Life at UiPathThe people at UiPath believe in the transformative power of automation to change how the world works. We're committed to creating category-leading enterprise software that unleashes that power.To make that happen, we need people who are curious, self-propelled, generous, and genuine. People who love being part of a fast-moving, fast-thinking...
Site Reliability Engineer II

1 week ago

Bengaluru, Karnataka, India UiPath Full time ₹ 15,00,000 - ₹ 25,00,000 per year

Life at UiPath The people at UiPath believe in the transformative power of automation to change how the world works. We're committed to creating category-leading enterprise software that unleashes that power. To make that happen, we need people who are curious, self-propelled, generous, and genuine. People who love being part of a fast-moving,...
Site Reliability Engineer

4 days ago

Bengaluru, Karnataka, India Ivanti Full time ₹ 8,00,000 - ₹ 24,00,000 per year

Are you ready to help elevate the reliability and performance of cloud services for global enterprise clients? Join Ivanti's growing Site Reliability Engineering (SRE) team and play a vital role in deploying, automating, and securing SaaS solutions trusted by organizations worldwide. If you thrive in a collaborative, fast-paced environment and love solving...
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India Warner Bros. Discovery Full time ₹ 15,00,000 - ₹ 25,00,000 per year

Welcome to Warner Bros. Discovery the stuff dreams are made of.Who We AreWhen we say, the stuff dreams are made of," we're not just referring to the world of wizards, dragons and superheroes, or even to the wonders of Planet Earth. Behind WBD's vast portfolio of iconic content and beloved brands, are the storytellers bringing our characters to life, the...
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India AppHelix Full time ₹ 9,00,000 - ₹ 12,00,000 per year

Role DescriptionThis is a full-time on-site role located in Bengaluru for a Site Reliability Engineer. The Site Reliability Engineer will be responsible for maintaining and improving the reliability of AppHelix's systems. Daily tasks include monitoring system performance, troubleshooting issues, managing infrastructure, and supporting software development....
Site Reliability Engineer II

6 days ago

Bengaluru, Karnataka, India American Express Full time ₹ 15,00,000 - ₹ 20,00,000 per year

You Lead the Way. We've Got Your Back.With the right backing, people and businesses have the power to progress in incredible ways. When you join Team Amex, you become part of a global and diverse community of colleagues with an unwavering commitment to back our customers, communities and each other. Here, you'll learn and grow as we help you create a career...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer II