
Sr. Site Reliability Engineer
2 weeks ago
*What you will do*
In this vital role you will play a key role in building, scaling, and securing the platforms that underpin Amgens global digital initiatives. This role focuses on ensuring the reliability, performance, and efficiency of cloud-native platforms while enabling development velocity and operational excellence.
You will be responsible for designing and operating infrastructure and shared platforms used across the enterprise, including CI/CD, observability, incident management, and collaboration systems.
You will work extensively with containerized environments, handle multi-tenant Kubernetes platforms, and automate processes to improve resilience and reduce operational burden. This role requires deep technical depth, leadership skills, and the ability to drive initiatives across cross-functional teams and global stakeholders.
*Roles & Responsibilities:*
Platform Reliability Engineering
- Design, operate, and scale secure, highly available cloud-based infrastructure using Infrastructure as Code (IaC).
- Handle multi-tenant container orchestration environments with advanced access controls, workload isolation, and governance policies.
- Ensure enterprise CI/CD platforms are performant, secure, and optimized for high-throughput engineering teams.
Monitoring, Observability & Incident Management
- Build and handle observability platforms for full-stack visibility, leveraging metrics, logs, and traces.
- Define, implement, and continuously refine SLIs, SLOs, and error budgets for platform health and service performance.
- Automate incident response workflows, integrate with incident management platforms, and lead post-incident reviews and root cause analysis.
- Enterprise Platform Administration
- Operate and improve core engineering platforms (e.g., CI/CD, collaboration, knowledge sharing) to ensure availability, security, and ease of use.
- Automate platform provisioning, upgrades, access controls, and integration pipelines to reduce manual effort and improve consistency.
- Implement compliance, audit logging, and policy enforcement through code-driven governance models.
AI Adoption & Enablement
- Drive the adoption of AI/ML-based tools to enhance observability, incident prediction, remediation, and intelligent alerting.
- Evaluate and integrate AI-assisted automation platforms to reduce toil and improve operational efficiency.
- Partner with platform, security, and development teams to embed predictive analytics into dashboards, workflows, and root cause tooling.
- Champion a data-driven SRE practice by enabling thoughtful insights and anomaly detection across systems and platforms.
Leadership & Collaboration
- Serve as a technical thought leader and mentor within the SRE organization.
- Promote SRE principles and reliability culture across engineering teams.
- Collaborate with cross-functional stakeholders to influence architecture, roadmaps, and platform investment.
- Lead operational reviews and service health retrospectives, with a focus on continuous improvement.
- Participate in Agile and SAFe delivery processesincluding sprint planning, stand-ups, retrospectives, and PI planningto ensure security and platform reliability are embedded across development cycles.
Basic Qualifications:
- Doctorate degree / Master's degree / Bachelor's degree and 8 to 13 years in Computer Science, Information Technology, or a related technical field
- Demonstrated success operating cloud-native infrastructure in production environments
- Practical experience handling Kubernetes clusters and CI/CD environments at enterprise scale
- Exposure to global on-call or incident support rotations
- Excellent collaboration and communication skills across technical and non-technical teams
Preferred Qualifications:
Must-Have Skills:
- Deep experience with cloud platforms (AWS, Azure, or GCP), including services such as compute, networking, IAM, and VPC design
- Proven proficiency in Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation
- Advanced skills in managing container orchestration platforms (e.g., Kubernetes), including workload isolation, resource quotas, and role-based access control
- Strong understanding of Linux system administration , process management, and system performance tuning
- Hands-on experience with CI/CD platforms and pipelines (build automation, artifact storage, environment provisioning, rollback strategies)
- Strong background in observability tooling , including Prometheus , Grafana , Dynatrace , and distributed tracing frameworks like OpenTelemetry or Jaeger
- Strong practical experience with incident management platforms and practices (e.g., alert routing, runbooks, escalation paths)
- Automation and scripting proficiency in languages such as Python , Go , or Bash
- Experience with configuration management tools like Ansible , Chef , or SaltStack
- Strong grasp of networking fundamentals , such as routing, DNS, OSI layers, load balancing, firewalls, TLS, and security groups
- Version control and collaboration workflows using Git and GitOps principles
- Experience with enterprise collaboration platforms , including provisioning, integration, and permission control
Good-to-Have Skills:
- Exposure to service mesh technologies (e.g., Istio, Linkerd) and zero-trust network concepts
- Familiarity with secrets management platforms (e.g., HashiCorp Vault, AWS Secrets Manager)
- Experience using incident response and chaos engineering tools (e.g., Gremlin, Chaos Mesh)
- Background in cost optimization , budgeting, and resource tracking (FinOps)
- Awareness of policy-as-code frameworks (e.g., OPA, Kyverno)
- Familiarity with feature flagging and progressive delivery tools (e.g., LaunchDarkly, Argo Rollouts)
- Integration experience with ticketing and change management platforms (e.g., ServiceNow, Jira)
- Understanding of compliance standards (e.g., HIPAA, GDPR, SOC 2) and how they apply to infrastructure operations
- Understanding of security and encryption technologies and authentication protocols such as OpenID, OIDC, OAuth, SAML, and LDAP
Professional Certifications (Preferred)
- Cloud DevOps Certification (AWS/Azure/GCP)
- Certified Kubernetes Administrator (CKA) or Security Specialist (CKS)
- CI/CD Platform Certification
- ITIL Foundation or equivalent service management certification
Soft Skills:
- High level of ownership and accountability for platform reliability
- Strong diagnostic and analytical capabilities with a bias for action
- Clear and confident communicator with an ability to influence without authority
- Passion for automation, operational excellence, and team mentorship
-
Sr. Site Reliability Engineer
1 day ago
Hyderabad, Telangana, India TECHBLOCKS Full time ₹ 10,00,000 - ₹ 25,00,000 per yearJob Title: Senior Site Reliability Engineer (SRE)Location: Hyderabad / AhmedabadEmployment Type: Full-TimeWork Model - 3 Days from officeJob OverviewDynamic, motivated individuals deliver exceptional solutions for the production resiliency of the systems. The role incorporates aspects of software engineering and operations, DevOps skills to come up with...
-
Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India Apple Full time ₹ 15,00,000 - ₹ 25,00,000 per yearImagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn't have imagined — and now can't imagine living without. If you're motivated by the idea of making a real impact, and joining a team where we pride ourselves in being one of the most diverse...
-
SRE(Site Reliability Engineer)
1 day ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 20,00,000 - ₹ 25,00,000 per yearSRE (Site Reliability Engineer)Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services. Your work will involve both software engineering and systems operations as you strive to improve...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Amgen Technology Private Limited Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSr. Site Reliability Engineer Career CategoryInformation Systems Job Description Join Amgen's Mission of Serving Patients At Amgen, if you feel like you're part of something bigger, it's because you are. Our shared mission—to serve patients living with serious illnesses—drives all that we do. Since 1980, we've helped pioneer the world of...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India TurboHire Full time ₹ 15,00,000 - ₹ 28,00,000 per yearSite Reliability Engineer (SRE)Location: Hyderabad (Hybrid)Experience: 3–5 yearsAbout the RoleWe are looking for an SRE Engineer to own reliability, deployment, and monitoringof TurboHire's cloud infrastructure. You will ensure our platform is scalable, secure,and highly available. The role balances hands-on coding, automation, and infraoperations, freeing...
-
Site Reliability Engineer
1 day ago
Hyderabad, Telangana, India LivePerson Full time ₹ 8,00,000 - ₹ 15,00,000 per yearLivePerson (NASDAQ: LPSN) is a leading customer engagement company, creating digital experiences powered by Curiously Human AI. Every person is unique, and our technology makes it possible for companies, including leading brands like HSBC, Orange, and GM Financial, to treat their audiences that way at scale. Nearly a billion conversational interactions are...
-
Senior Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Aqua Security Full time ₹ 10,00,000 - ₹ 25,00,000 per yearWe are seeking a skilled and an experienced Sr. SRE (Site Reliability Engineer) to join our dynamic SRE Platform team. As an SRE Engineer, you will play a crucial role in the design, development, and implementation of our infrastructure and deployment processes.Your primary focus will be on maintaining and improving our system's reliability, scalability, and...
-
Lead Site Reliability Engineer
5 days ago
Hyderabad, Telangana, India EPAM Systems Full time ₹ 15,00,000 - ₹ 25,00,000 per yearWe are seeking a skilledLead Site Reliability Engineerto drive the stability, scalability, and reliability of our systems while improving efficiency through automation and best practices.This role calls for deep expertise in DevOps methodologies, Infrastructure as Code (IaC), and collaboration across teams to ensure optimal system...
-
Principal Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Amgen Inc Full time ₹ 8,00,000 - ₹ 12,00,000 per yearWe are looking for a Site Reliability Engineer/Cloud Engineer (SRE) to work on the performance optimization, standardization, and automation of Amgens critical infrastructure and systems. This role is crucial to ensuring the reliability, scalability, and cost-effectiveness of our production systems. The ideal candidate will work on operational excellence...
-
Lead Site Reliability Engineer
1 day ago
Hyderabad, Telangana, India JPMorgan Chase Full time ₹ 12,00,000 - ₹ 36,00,000 per yearAssume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability. As a Lead Site Reliability Engineer at JPMorgan Chase within the Consumer & Community Banking, you hold a leadership role in your team, demonstrate strong knowledge across multiple...