
Site Reliability Engineer
2 days ago
Job Description
Exp : 4- 10 Years
Location : Chennai
Work Mode: Hybrid (2 days Office)
We are looking for a Site Reliability Engineer (SREs) who will lead the Site Reliability Engineering(SRE) side of each of our products. This position is responsible for making technical decisions, collaborating with development teams and platform engineers, and building and operating highly reliable and scalable products.
The SREs mission is to design, build, and operate highly reliable systems that support stable business growth. Specifically, SREs quantitatively measure and manage system reliability, achieving appropriate risk balance through SLI/SLOs. By automating operations to reduce human error, responding quickly to incidents, conducting root cause analysis, and driving continuous improvement, SREs enhance service resilience. Through these efforts, SREs cultivate a culture within the organization that blends engineering and operational best practices.
Expected Role
In this role, you will act as a leader who identifies technical challenges within development teams, proactively plans solutions, and drives projects to resolution. By closely collaborating with developers and platform engineers, you will promote continuous improvements, ensuring that products remain resilient, scalable, and aligned with business objectives.
Key Responsibilities
1. Service Reliability & Scalability
- Design, build, and maintain highly available and scalable production services
- Define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure system reliability and performance
- Analyze and improve system bottlenecks and conduct capacity planning
2. Incident Management
- Lead incident response efforts to mitigate and resolve production issues quickly
- Conduct postmortems and root cause analyses to prevent recurrence
- Continuously improve the incident management process and optimize on-call operations
3. Automation & Operational Efficiency
- Automate operational tasks using Infrastructure as Code (IaC) tools such as Terraform
- Implement self-healing and auto-scaling mechanisms for infrastructure components
- Optimize deployment pipelines and CI/CD workflows to improve release efficiency and rollback capabilities
4. Observability & Monitoring
- Design and implement comprehensive monitoring, logging, and tracing strategies using tools like OpenTelemetry, Grafana, Prometheus, and Datadog
- Optimize alerting mechanisms to reduce noise and improve actionable insights
- Continuously enhance system visibility and root cause analysis capabilities
5. Leading the SRE
- Collaborate with development teams to identify and resolve operational and reliability-related technical challenges
- Define and execute reliability strategies as an SREs
- Act as a technical advisor on SRE methodologies within the organization
6. Collaboration & Knowledge Sharing
- Work closely with other SREs, platform engineers, and developers to optimize infrastructure and improve reliability
- Enabling developers capability of SRE practice
- Develop internal tools and best practices to enhance operational efficiency
Requirements
We are looking for individuals who fulfill multiple of the following skills and qualifications:
- Few years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
- Some coding experience is required (does not need to be web applications; experience with batch processing or small automation scripts only is acceptable)
- shell(e.g. bash) only experience is not acceptable. Experience with some statically typed(e.g. C, C++, Java, Rust, Go, Scala.. ) or dynamically typed(e.g. Perl, Ruby, Python, PHP, JavaScript...) language is required.
- Experience collaborating with development teams to enhance system reliability
- Technical leadership experience (mentoring and supporting team members in technical areas)
- Strong problem-solving skills and ability to take ownership of reliability-related challenges
- Proven experience in project management (identifying issues, planning solutions, driving execution, and coordinating stakeholders)
- Multiple experiences in the following technical areas:
- Experience operating Kubernetes in a production environment
- Proficiency in Infrastructure as Code (IaC) tools (e.g., Terraform, Crossplane)
- Experience with CI/CD automation tools (e.g., ArgoCD, CircleCI, GitHub Actions)
- Hands-on experience with observability tools (e.g., Prometheus, OpenTelemetry, Grafana, Datadog)
- Familiarity with cloud platforms (AWS or others) and cloud-native architectures
- Experience in incident management, disaster recovery, and high availability strategies
Preferred Qualifications
- Experience fostering SRE best practices within an organization
- Deep understanding of microservices architecture and its operational challenges
- Proficiency in programming languages such as Go, Python, or Bash for automation and tooling development
- Contributions to CNCF projects or open-source communities
Work Environment
- Opportunity to lead define reliability strategies in a rapidly growing organization
- Collaboration with global teams in an agile and technically driven environment
- Hands-on experience with large-scale distributed systems and cutting-edge cloud-native technologies
- A culture that values automation, reliability, and continuous improvement.
-
Site Reliability Engineer
4 days ago
Chennai, Tamil Nadu, India Zyoin Group Full timePosition: Site Reliability Engineer (SRE)Experience: 4 – 10 YearsLocation: Chennai (Hybrid – 2 days in office)Role Overview:We are seeking a Site Reliability Engineer (SRE) responsible for leading reliability practices, ensuring scalable systems, and collaborating with development teams to maintain highly available services.Key Responsibilities- Design,...
-
Site Reliability Engineer
2 days ago
Chennai, Tamil Nadu, India Zyoin Group Full timePosition: Site Reliability Engineer (SRE) Experience: 4 – 10 Years Location: Chennai (Hybrid – 2 days in office) Role Overview: We are seeking a Site Reliability Engineer (SRE) responsible for leading reliability practices, ensuring scalable systems, and collaborating with development teams to maintain highly available services. Key Responsibilities ...
-
Site Reliability Engineering Lead
2 days ago
Chennai, Tamil Nadu, India beBeeReliability Full time ₹ 2,00,00,000 - ₹ 2,50,00,000Job Overview:We are seeking an experienced Site Reliability Engineering Lead to oversee the reliability, scalability, and performance of our systems.As a Site Reliability Engineering Lead, you will establish and implement SRE practices, lead a team of engineers, and drive automation, monitoring, and incident response strategies.This position combines...
-
Senior Site Reliability Engineer
1 day ago
Chennai, Tamil Nadu, India beBeeDevops Full time ₹ 12,00,000 - ₹ 24,00,000Job Title:DevOps Engineer with Site Reliability Engineering
-
Senior Site Reliability Engineer
4 weeks ago
Chennai, Tamil Nadu, India Keuro Life Full timeSite Reliability Engineer / DevOpsWe are seeking an experienced Site Reliability Engineer / DevOps professional with a minimum of 6 years in the industry. The ideal candidate will be adept at managing large-scale, high-traffic production environments and ensuring their reliability.Key Responsibilities : - Manage and optimize production environments to...
-
Site Reliability Engineer
1 day ago
Chennai, Tamil Nadu, India ViaSat Full timeAbout us One team Global challenges Infinite opportunities At Viasat were on a mission to deliver connections with the capacity to change the world For more than 35 years Viasat has helped shape how consumers businesses governments and militaries around the globe communicate Were looking for people who think big act fearlessly and create an...
-
Site Reliability Engineer
3 days ago
Chennai, Tamil Nadu, India Trimble Inc. Full timeJob DescriptionJob SummaryWe are seeking a motivated Site Reliability Engineer (SRE) Level 1 to enhance the infrastructure and operational reliability of our ERP product, specifically within Azure and Windows environments. The ideal candidate will utilize SRE principles to ensure high system availability, stability, and performance while collaborating...
-
Site Reliability Engineer III
2 days ago
Chennai, Tamil Nadu, India ACV Full time US$ 1,04,000 - US$ 1,30,878 per yearIf you are looking for a career at a dynamic company with a people-first mindset and a deep culture of growth and autonomy, ACV is the right place for you Competitive compensation packages and learning and development opportunities, ACV has what you need to advance to the next level in your career. We will continue to raise the bar every day by investing in...
-
Senior Site Reliability Engineer
4 weeks ago
Chennai, Tamil Nadu, India Athenahealth Full timeJoin us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.We are looking for a Senior Site Reliability Engineer to join our Service Operations, Site Reliability Engineering team within the Cloud Infrastructure Engineering division. This team is newly formed and is responsible for managing the...
-
Associate, Site Reliability Engineer
1 week ago
Chennai, Tamil Nadu, India Pfizer Full time US$ 1,25,000 - US$ 1,75,000 per yearROLE SUMMARYAt Pfizer we make medicines and vaccines that change patients' lives with a global reach of over 780 million patients. Pfizer Digital is the organization charged with winning the digital race in the pharmaceutical industry. We apply our expertise in technology, innovation, and our business to support Pfizer in this mission.Our team, the Global...