
▷ [Apply in 3 Minutes] Associate Manager SRE
4 weeks ago
Overview
We are seeking a self-driven, inquisitive, and curious Site Reliability Engineer (SRE) to drive reliability, availability, performance, and security across our global digital product ecosystem. This role is central to ensuring a seamless and resilient experience for our users by blending deep engineering expertise with operational excellence and automation.
You will be part of a global SRE practice supporting a portfolio of 260+ modern cloud-native applications across consumer, commercial, supply chain, and enablement functions. Your mission: prevent incidents before they occur, ensure rapid recovery when they do, and build scalable systems that evolve with our growing business.
Responsibilities
Champion reliability, observability, and operational excellence across mission-critical applications.
- Develop and maintain service-level indicators (SLIs), objectives (SLOs), and error budgets to measure and improve system performance.
- Implement automated monitoring, alerting, and recovery mechanisms to reduce manual intervention and improve response times.
- Collaborate closely with software engineering, platform, and operations teams to embed SRE practices across the development lifecycle.
- Lead and participate in incident response, root cause analysis, and postmortem reviews to drive long-term improvements.
- Identify and eliminate sources of toil through automation, tooling, and process refinement.
- Continuously improve resiliency design, capacity planning, and release management in production systems.
- Influence engineering teams with best practices on cloud-native architecture, observability, and deployment strategies.
Qualifications
Required Skills:
- 5+ years of experience in production engineering, DevOps, or SRE roles.
- Strong foundation in Linux systems, networking, and cloud platforms (Azure, AWS, or GCP).
- Hands-on experience with observability tools (e.g., AppDynamics, Prometheus, Grafana, ELK, FullStory).
- Proficiency in scripting or programming (e.g., Python, Bash, Go) and automation frameworks (e.g., Ansible, Terraform).
- Deep understanding of CI/CD pipelines, release strategies, and deployment automation.
- Experience in managing high-scale, distributed systems in cloud-native environments.
- Strong analytical skills and a passion for continuous improvement.
Preferred Skills:
- Familiarity with microservices, Kubernetes, containers, and service mesh architecture.
- Exposure to incident and problem management frameworks (e.g., ITIL, RCA practices).
- Experience working in global teams supporting mission-critical applications.
Required Skills:
- 5+ years of experience in production engineering, DevOps, or SRE roles.
- Strong foundation in Linux systems, networking, and cloud platforms (Azure, AWS, or GCP).
- Hands-on experience with observability tools (e.g., AppDynamics, Prometheus, Grafana, ELK, FullStory).
- Proficiency in scripting or programming (e.g., Python, Bash, Go) and automation frameworks (e.g., Ansible, Terraform).
- Deep understanding of CI/CD pipelines, release strategies, and deployment automation.
- Experience in managing high-scale, distributed systems in cloud-native environments.
- Strong analytical skills and a passion for continuous improvement.
Preferred Skills:
- Familiarity with microservices, Kubernetes, containers, and service mesh architecture.
- Exposure to incident and problem management frameworks (e.g., ITIL, RCA practices).
- Experience working in global teams supporting mission-critical applications.
Champion reliability, observability, and operational excellence across mission-critical applications.
- Develop and maintain service-level indicators (SLIs), objectives (SLOs), and error budgets to measure and improve system performance.
- Implement automated monitoring, alerting, and recovery mechanisms to reduce manual intervention and improve response times.
- Collaborate closely with software engineering, platform, and operations teams to embed SRE practices across the development lifecycle.
- Lead and participate in incident response, root cause analysis, and postmortem reviews to drive long-term improvements.
- Identify and eliminate sources of toil through automation, tooling, and process refinement.
- Continuously improve resiliency design, capacity planning, and release management in production systems.
- Influence engineering teams with best practices on cloud-native architecture, observability, and deployment strategies.
-
[Apply in 3 Minutes] Senior Systems Engineer Ii
3 weeks ago
Hyderabad, Telangana, India Marriott Tech Accelerator Full timePosition Summary The Senior Site Reliability Engineer SRE is responsible for the reliability scalability and performance of mission-critical cloud and on-prem services that support millions of Marriot customers globally This role involves overseeing incident management driving automation efforts and working closely with cross-functional teams to...
-
Hyderabad, India SID Global Solutions Full timeJob Role: Site Reliability Engineer (SRE) – GCP Experience: 3+ years Location: Hyderabad About SIDGS: SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
-
[Apply in 3 Minutes] Software Engineer
1 day ago
Hyderabad, Telangana, India Jobted IN C2 Full timeCompany Atlas Consolidated Pte Ltd Role Software Engineer - DevOps Experience Min 2 years Job Type Full-Time Permanent Location Hyderabad India Work From Office Hello and welcome Atlas Consolidated Pte Ltd owns and operates two brands Hugosave a B2C consumer finance app and HugoHub a B2B Banking as a Service platform Atlas is Headquartered in Singapore 100K...
-
Associate Manager SRE
2 weeks ago
Hyderabad, Telangana, India PepsiCo Full time ₹ 20,00,000 - ₹ 25,00,000 per yearOverviewWe are seeking a self-driven, inquisitive, and curious Site Reliability Engineer (SRE) to drive reliability, availability, performance, and security across our global digital product ecosystem. This role is central to ensuring a seamless and resilient experience for our users by blending deep engineering expertise with operational excellence and...
-
Associate Manager SRE
4 weeks ago
Hyderabad, India PepsiCo Full timeJob Description Overview We are seeking a self-driven, inquisitive, and curious Site Reliability Engineer (SRE) to drive reliability, availability, performance, and security across our global digital product ecosystem. This role is central to ensuring a seamless and resilient experience for our users by blending deep engineering expertise with operational...
-
SRE Design
4 weeks ago
Hyderabad, Telangana, India Pepsico Full timeOverviewWe are looking for a self-driven, software engineering mindset SRE engineer to- Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes- Apply pre-emptive approach into production minimizing business impact,...
-
SRE
2 weeks ago
Hyderabad, India Virtusa Full timeSRE - CREQ Description Bi Tools, API & Batch monitoring Support Responsibilities 1. Troubleshoot Recurring failures & participate in incident triages 2. Troubleshoot issues, both from a production as well as a performance standpoint 3. on-call to be able to respond during App failures 4. Monitor critical applications and services to minimize downtime and...
-
SRE
2 weeks ago
Hyderabad, India Virtusa Full timeSRE - CREQ Description Bi Tools, API & Batch monitoring Support Responsibilities 1. Troubleshoot Recurring failures & participate in incident triages 2. Troubleshoot issues, both from a production as well as a performance standpoint 3. on-call to be able to respond during App failures 4. Monitor critical applications and services to minimize downtime and...
-
CloudOps Engineer | Hyderabad
4 weeks ago
Hyderabad, India Unison Group Full timeCloud Operations Engineer Key Responsibilities Operational Excellence & SRE - Drive Site Reliability Engineering (SRE) practices, including SLIs, SLOs, SLAs, error budgets, and automation of operational tasks. - Manage incident response, root cause analysis, and post-incident reviews to strengthen platform resilience. - Build and optimize observability...
-
Hyderabad, India NuStudio.ai Full timeWe’re Hiring: Cloud & Infrastructure Engineers (3–8 Years | Hyderabad, IN) At NuStudio.ai, we’re building next-gen, AI-powered data platforms and immersive products that power real-time intelligence across industries. We’re now expanding our Cloud & Infra team — the builders behind the scenes who make it all run fast, secure, and scalable. What...