Sr. Site Reliability Engineer

5 hours ago


Hyderabad, Telangana, India Amgen Inc Full time ₹ 12,00,000 - ₹ 36,00,000 per year

*What you will do*

In this vital role you will play a key role in building, scaling, and securing the platforms that underpin Amgens global digital initiatives. This role focuses on ensuring the reliability, performance, and efficiency of cloud-native platforms while enabling development velocity and operational excellence.

You will be responsible for designing and operating infrastructure and shared platforms used across the enterprise, including CI/CD, observability, incident management, and collaboration systems.

You will work extensively with containerized environments, handle multi-tenant Kubernetes platforms, and automate processes to improve resilience and reduce operational burden. This role requires deep technical depth, leadership skills, and the ability to drive initiatives across cross-functional teams and global stakeholders.

*Roles & Responsibilities:*

Platform Reliability Engineering

  • Design, operate, and scale secure, highly available cloud-based infrastructure using Infrastructure as Code (IaC).
  • Handle multi-tenant container orchestration environments with advanced access controls, workload isolation, and governance policies.
  • Ensure enterprise CI/CD platforms are performant, secure, and optimized for high-throughput engineering teams.

Monitoring, Observability & Incident Management

  • Build and handle observability platforms for full-stack visibility, leveraging metrics, logs, and traces.
  • Define, implement, and continuously refine SLIs, SLOs, and error budgets for platform health and service performance.
  • Automate incident response workflows, integrate with incident management platforms, and lead post-incident reviews and root cause analysis.
  • Enterprise Platform Administration
  • Operate and improve core engineering platforms (e.g., CI/CD, collaboration, knowledge sharing) to ensure availability, security, and ease of use.
  • Automate platform provisioning, upgrades, access controls, and integration pipelines to reduce manual effort and improve consistency.
  • Implement compliance, audit logging, and policy enforcement through code-driven governance models.

AI Adoption & Enablement

  • Drive the adoption of AI/ML-based tools to enhance observability, incident prediction, remediation, and intelligent alerting.
  • Evaluate and integrate AI-assisted automation platforms to reduce toil and improve operational efficiency.
  • Partner with platform, security, and development teams to embed predictive analytics into dashboards, workflows, and root cause tooling.
  • Champion a data-driven SRE practice by enabling thoughtful insights and anomaly detection across systems and platforms.

Leadership & Collaboration

  • Serve as a technical thought leader and mentor within the SRE organization.
  • Promote SRE principles and reliability culture across engineering teams.
  • Collaborate with cross-functional stakeholders to influence architecture, roadmaps, and platform investment.
  • Lead operational reviews and service health retrospectives, with a focus on continuous improvement.
  • Participate in Agile and SAFe delivery processesincluding sprint planning, stand-ups, retrospectives, and PI planningto ensure security and platform reliability are embedded across development cycles.

Basic Qualifications:

  • Doctorate degree / Master's degree / Bachelor's degree and 8 to 13 years in Computer Science, Information Technology, or a related technical field
  • Demonstrated success operating cloud-native infrastructure in production environments
  • Practical experience handling Kubernetes clusters and CI/CD environments at enterprise scale
  • Exposure to global on-call or incident support rotations
  • Excellent collaboration and communication skills across technical and non-technical teams

Preferred Qualifications:

Must-Have Skills:

  • Deep experience with cloud platforms (AWS, Azure, or GCP), including services such as compute, networking, IAM, and VPC design
  • Proven proficiency in Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation
  • Advanced skills in managing container orchestration platforms (e.g., Kubernetes), including workload isolation, resource quotas, and role-based access control
  • Strong understanding of Linux system administration , process management, and system performance tuning
  • Hands-on experience with CI/CD platforms and pipelines (build automation, artifact storage, environment provisioning, rollback strategies)
  • Strong background in observability tooling , including Prometheus , Grafana , Dynatrace , and distributed tracing frameworks like OpenTelemetry or Jaeger
  • Strong practical experience with incident management platforms and practices (e.g., alert routing, runbooks, escalation paths)
  • Automation and scripting proficiency in languages such as Python , Go , or Bash
  • Experience with configuration management tools like Ansible , Chef , or SaltStack
  • Strong grasp of networking fundamentals , such as routing, DNS, OSI layers, load balancing, firewalls, TLS, and security groups
  • Version control and collaboration workflows using Git and GitOps principles
  • Experience with enterprise collaboration platforms , including provisioning, integration, and permission control

Good-to-Have Skills:

  • Exposure to service mesh technologies (e.g., Istio, Linkerd) and zero-trust network concepts
  • Familiarity with secrets management platforms (e.g., HashiCorp Vault, AWS Secrets Manager)
  • Experience using incident response and chaos engineering tools (e.g., Gremlin, Chaos Mesh)
  • Background in cost optimization , budgeting, and resource tracking (FinOps)
  • Awareness of policy-as-code frameworks (e.g., OPA, Kyverno)
  • Familiarity with feature flagging and progressive delivery tools (e.g., LaunchDarkly, Argo Rollouts)
  • Integration experience with ticketing and change management platforms (e.g., ServiceNow, Jira)
  • Understanding of compliance standards (e.g., HIPAA, GDPR, SOC 2) and how they apply to infrastructure operations
  • Understanding of security and encryption technologies and authentication protocols such as OpenID, OIDC, OAuth, SAML, and LDAP

Professional Certifications (Preferred)

  • Cloud DevOps Certification (AWS/Azure/GCP)
  • Certified Kubernetes Administrator (CKA) or Security Specialist (CKS)
  • CI/CD Platform Certification
  • ITIL Foundation or equivalent service management certification

Soft Skills:

  • High level of ownership and accountability for platform reliability
  • Strong diagnostic and analytical capabilities with a bias for action
  • Clear and confident communicator with an ability to influence without authority
  • Passion for automation, operational excellence, and team mentorship


  • Hyderabad, Telangana, India Talent Worx Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Site Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...


  • Hyderabad, Telangana, India 2a1d0a41-1875-4bbb-b5a8-e4d5620cfd5f Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Role & responsibilitiesCoordinates cross-product chaos experimentation to proactively test system resilience and uncover reliability gaps.Maintains the centralized incident response playbook for the subdivision, documenting standards for communication, escalation, and recovery during incidents. Aggregates and reports quantifiable availability data to senior...


  • Hyderabad, Telangana, India Assurant Full time ₹ 6,00,000 - ₹ 12,00,000 per year

    Site Reliability Engineer, GCC-AssurantThe Site Reliability Engineer (SRE) will be part of the Assurant Reliability Team, specifically within the Site Reliability Engineering area. This remote position, based in India, focuses on building and maintaining reliable, scalable systems through a combination of software development and network diagnostics. The...


  • Hyderabad, Telangana, India Assurant Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Site Reliability Engineer, GCC-Assurant The Site Reliability Engineer (SRE) will be part of the Assurant Reliability Team, specifically within the Site Reliability Engineering area. This remote position, based in India, focuses on building and maintaining reliable, scalable systems through a combination of software development and network diagnostics. The...


  • Hyderabad, Telangana, India BYLD Group Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    DescriptionJob Title :Site Reliability Engineer (SRE) - DataDog / AWS Lambda / DynamoDB / ServerlessLocation :Bangalore / Pune / HyderabadExperience :5- 10 YearsAbout The RoleWe are seeking an experienced Site Reliability Engineer (SRE) with strong expertise in DataDog integration, AWS Lambda, DynamoDB, and Serverless architectures. The ideal candidate will...


  • Hyderabad, Telangana, India Evalify-IQ Full time ₹ 6,00,000 - ₹ 18,00,000 per year

    Skills Required:AWS, Azure, Terraform, CloudFormation, Cloudformation, Pulumi, CICD, GitHub Actions,GitLab CI, Jenkins, ArgoCD, Prometheus, Splunk, Grafana, Cloudwatch, Datadog, SRE,Site Reliability, Python, Powershell, Shell, Go, Kubernetes, Docker, Performance Tuning,Performance Enhancements, Performance Enhancement, PerformanceExperience Range:2 - 5...


  • Hyderabad, Telangana, India Elios Talent Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Site Reliability EngineerKey Highlights Build, automate, and support cloud-native infrastructure powering high-availability platforms Contribute to automation-first engineering across AWS, Terraform, CI/CD, and observability tooling Improve reliability, uptime, system health, and performance across production environments Strengthen DevSecOps...


  • Hyderabad, Telangana, India Amgen Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Career CategoryInformation SystemsJob DescriptionJoin Amgen's Mission of Serving PatientsAt Amgen, if you feel like you're part of something bigger, it's because you are. Our shared mission—to serve patients living with serious illnesses—drives all that we do.Since 1980, we've helped pioneer the world of biotech in our fight against the world's toughest...


  • Hyderabad, Telangana, India Talent Worx Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    SRE (Site Reliability Engineer)Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services. Your work will involve both software engineering and systems operations as you strive to improve...


  • Hyderabad, Telangana, India JPMorganChase Full time ₹ 12,00,000 - ₹ 24,00,000 per year

    DescriptionJoin us for an exciting opportunity to advance your site reliability engineering career and make a real impact.Job summaryAs a Site Reliability Engineer III at JPMorgan Chase within Corporate Technology and Risk Technology, you will design, maintain, and optimize applications and infrastructure to support the firm's business objectives. You will...