 
						Site Reliability Engineer
7 days ago
Position Summary:
We are seeking a proactive and innovative Site Reliability Engineer to join our growing team. In this role, you will be a key player in ensuring the reliability, scalability, and performance of our critical systems. You will move beyond traditional monitoring to implement advanced observability, leverage AIOps for predictive insights, and use Chaos Engineering to proactively uncover system weaknesses. This is an opportunity to help shape a modern SRE culture, automate away toil, and empower our development teams to build more resilient applications from the ground up.
Key Responsibilities
- Observability & Proactive System Health 
- Design, build, and maintain a comprehensive observability platform using tools like Splunk and OpenTelemetry to provide deep insights into system health and performance. 
- LeverageAIOpsprinciples and platforms to enhance anomaly detection, automate event correlation, and enable predictive alerting, reducing mean time to detection (MTTD). 
- Develop and manage robust alerting strategies and SLO-based dashboards to ensure critical issues are addressed before they impact customers. 
- Drive a data-driven culture by providing engineering teams with the visibility they need to understand the impact of their code in production. 
- Reliability & Resilience Engineering 
- Design, implement, and conductChaos Engineeringexperiments to proactively identify and remediate system weaknesses, architectural flaws, and potential cascading failures. 
- Partner with software engineering teams throughout the application lifecycle to architect for high availability, disaster recovery, and fault tolerance. 
- Define, measure, and evangelize Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the associated error budgets to balance reliability with feature velocity. 
- Analyze and lead blameless post-mortems for incidents, ensuring that root causes are addressed and preventative measures are implemented to avoid recurrence. 
- Performance & Efficiency Optimization 
- Analyze performance metrics and distributed traces to identify and resolve latency bottlenecks across our infrastructure and applications. 
- Implement cost optimization (FinOps) strategies by identifying and eliminating resource waste, optimizing cloud service usage, and promoting efficient architecture patterns. 
- Work with development teams to conduct performance testing and ensure new features do not introduce performance regressions. 
- Automation & Platform Engineering 
- Identify and aggressively automate manual operational tasks (toil) by developing scripts, tools, and self-healing systems. 
- Enhance and maintain our Infrastructure as Code (IaC) modules, promoting reusable patterns and best practices with Terraform. 
- Improve and secure CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) to enable safe, automated, and rapid deployment and rollback procedures. 
Requirements and Qualifications
Core Technical Skills
- Experience:4+ years in a Site Reliability, DevOps, or Cloud Engineering role, with demonstrable experience in a large-scale production environment. 
- Cloud Proficiency:Deep experience with AWS services (EKS, ECS, EC2, S3, RDS, Lambda) and managing production workloads in the cloud. 
- Observability:Proficient in application observability, monitoring, and logging. Hands-on experience with tools like Splunk, OpenTelemetry, Prometheus, Grafana, or Datadog is essential. 
- Infrastructure as Code (IaC):Strong experience with Terraform for provisioning and managing cloud infrastructure. 
- Containerization:Solid understanding of Containerization Technology particularly with managed services like EKS or ECS. 
- CI/CD:Experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Azure DevOps, or Jenkins. 
- Scripting & Automation:Strong scripting skills in languages like Python, Bash, or PowerShell for automation and tooling. Familiarity with a higher-level language such as C# (.NET) is a plus. 
- Modern Practices:Experience with or a demonstrated understanding ofAIOpsconcepts andChaos Engineeringprinciples and tools (e.g., Gremlin, AWS Fault Injection Simulator). 
Professional Attributes
- SRE Mindset:A true understanding of Site Reliability Engineering principles, including SLOs, error budgets, and the value of eliminating toil. 
- Problem-Solving:Excellent troubleshooting and problem-solving skills, with a methodical approach to resolving complex technical issues under pressure. 
- Collaboration:Ability to work effectively with development teams, product managers, and other stakeholders, communicating complex technical ideas clearly. 
- Ownership & Drive:A strong sense of ownership, urgency, and a passion for building and maintaininghighly available, performant, and reliable systems. 
- Agile Experience:Comfortable working in an agile environment and contributing to team sprints and planning. 
- On-Call:Willingness to participate in a scheduled on-call rotation 
Education & Certifications
- Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent practical experience. 
- AWS certification (e.g., AWS Certified Solutions Architect, DevOps Engineer) is highly preferred. 
- 
					  Site Reliability Engineer4 weeks ago 
 Noida, India CorroHealth Full timeWe are seeking a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a deep understanding of both software engineering and systems administration, with a focus on creating scalable and reliable systems. You will work closely with development and operations teams to ensure the reliability, availability, and... 
- 
					  Site Reliability Engineer4 weeks ago 
 Noida, India CorroHealth Full timeWe are seeking a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a deep understanding of both software engineering and systems administration, with a focus on creating scalable and reliable systems. You will work closely with development and operations teams to ensure the reliability, availability, and... 
- 
					  Site Reliability Engineer3 weeks ago 
 Noida, India CorroHealth Full timeWe are seeking a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a deep understanding of both software engineering and systems administration, with a focus on creating scalable and reliable systems. You will work closely with development and operations teams to ensure the reliability, availability, and... 
- 
					Site Reliability Engineer13 hours ago 
 Noida, Uttar Pradesh, India CorroHealth Full time ₹ 15,00,000 - ₹ 25,00,000 per yearWe are seeking a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a deep understanding of both software engineering and systems administration, with a focus on creating scalable and reliable systems. You will work closely with development and operations teams to ensure the reliability, availability, and... 
- 
					  Site Reliability Engineer3 days ago 
 Noida, Uttar Pradesh, India Times Internet Full time ₹ 1,04,000 - ₹ 1,30,878 per yearRole:Site Reliability EngineerExperience:8-14 yearsLocation:Sector 16, NoidaNotice Period:Immediate / Serving onlyAbout Times InternetAt Times Internet, we create premium digital products that simplify and enhance the lives ofmillions. As India's largest digital products company, we have a significant presence across awide range of categories, including... 
- 
					  Site Reliability Engineer2 weeks ago 
 Noida, Uttar Pradesh, India Cloud Angles Digital Transformation Full time ₹ 15,00,000 - ₹ 25,00,000 per yearAbout the Role:We are seeking a skilled and proactive Site Reliability Engineer I & II (SRE II) to join our growing infrastructure team. As an SRE II, you will play a critical role in ensuring the reliability, scalability, and performance of our systems. Youll work independently and collaboratively to design, implement, and maintain robust infrastructure... 
- 
					  Site Reliability Engineer3 days ago 
 Noida, Uttar Pradesh, India ALIQAN Technologies Full time ₹ 15,00,000 - ₹ 25,00,000 per yearGreetings from ALIQAN TechnologiesWe are hiring Site Reliability & DevOps Engineer for one of our client MNCs.Job Title:Devops EngineerExp: 4-6 YrsLocation:Remote Key ResponsibilitiesInfrastructure & Platform Engineering Design, implement, and maintain scalable cloud infrastructure using Infrastructure as Code (IaC) principles Architect and manage... 
- 
					  Urgent! Site Reliability Engineer4 weeks ago 
 Gurugram, Pune, India Prerna Malhotra (Proprietor Of Praxis Hr Solutions) Full timeJob Description Description We are seeking a skilled Site Reliability Engineer (SRE) to join our dynamic team in India. The SRE will be responsible for ensuring the reliability, availability, and performance of our applications and services. This role requires a combination of software engineering and systems engineering to build and maintain scalable and... 
- 
					  Site Reliability Engineer3 weeks ago 
 Noida, India ALIQAN Technologies Full timeGreetings from ALIQAN Technologies! We are hiring Site Reliability & DevOps Engineer for one of our client MNCs. Job Title:Devops Engineer Exp: 4-6 Yrs Location:Remote Key Responsibilities Infrastructure & Platform Engineering Design, implement, and maintain scalable cloud infrastructure using Infrastructure as Code (IaC) principles Architect and... 
- 
					  Site Reliability Engineer3 weeks ago 
 Noida, India ALIQAN Technologies Full timeGreetings from ALIQAN Technologies! We are hiring Site Reliability & DevOps Engineer for one of our client MNCs. Job Title:Devops Engineer Exp: 4-6 Yrs Location:Remote Key Responsibilities Infrastructure & Platform Engineering Design, implement, and maintain scalable cloud infrastructure using Infrastructure as Code (IaC) principles ...