Site Reliability Engineer
3 weeks ago
Role Overview :We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our client, Qincline. The ideal candidate will have 7 or more years of dedicated experience in Site Reliability Engineering or a closely related discipline. This pivotal role requires a strong focus on ensuring the reliability, scalability, performance, and operational efficiency of large-scale, complex production systems. You'll be instrumental in bridging the gap between development and operations by applying engineering principles to operational challenges.Key Responsibilities :Reliability & Performance Engineering :- System Reliability : Design, build, and maintain robust, fault-tolerant production systems and infrastructure to meet stringent Service Level Objectives (SLOs).- Performance Tuning : Proactively identify and resolve performance bottlenecks across the entire application stack, from infrastructure to application code.- Automation : Develop and implement automation for operational tasks, infrastructure provisioning, deployment, and monitoring to eliminate manual toil.- Capacity Planning : Collaborate with development teams on capacity planning, forecasting demand, and ensuring the infrastructure can scale efficiently to meet future business needs.Operations & Incident Management :- Monitoring & Alerting : Establish and maintain comprehensive monitoring, logging, and alerting systems to gain deep visibility into system health and performance (e.g., using Prometheus, Grafana, ELK Stack, etc.).- Incident Response : Serve as a key responder during critical incidents, performing rapid triage, mitigation, and recovery.- Post-Mortems & RCA : Lead detailed Post-Mortem and Root Cause Analysis (RCA) processes for all significant incidents, ensuring that permanent fixes and preventative measures are implemented to prevent recurrence.- On-Call : Participate in a periodic on-call rotation to provide 24/7 support for critical production systems.Tooling & Infrastructure :- CI/CD & DevOps : Enhance and manage CI/CD pipelines to facilitate fast, reliable, and automated software releases.- Containerization & Orchestration : Manage and optimize containerized environments using Docker and Kubernetes.- Infrastructure as Code (IaC) : Utilize IaC tools (e.g., Terraform, Ansible) to provision and manage infrastructure in a repeatable and documented manner.Required Skills & Experience :Core Experience (7+ Years) :- Minimum 7 years of hands-on experience in a Site Reliability Engineer, DevOps Engineer, or Production Engineer role supporting high-availability, mission-critical production environments.- Deep expertise in establishing and improving system monitoring, logging, alerting, and telemetry practices.- Demonstrated experience with formal Incident Management processes and leading thorough Root Cause Analysis (RCA).Technical Expertise :- Cloud Platforms : Extensive, hands-on experience with at least one major cloud provider (e.g., AWS, Azure, or GCP). This includes managing compute, networking, storage, and managed services.- Scripting & Programming : Strong proficiency in scripting and programming languages, with mandatory expertise in Python and Shell scripting for automation and tooling.- DevOps Tooling : Proven experience with CI/CD pipeline tools (e.g., Jenkins, GitLab CI, Azure DevOps), Git, and artifact repositories.- Containerization : Expert-level knowledge of Docker and robust experience with orchestrating large-scale deployments using Kubernetes.- Operating Systems : Strong command of Linux/Unix operating systems and networking fundamentals (TCP/IP, DNS, Load Balancing).Desired Qualifications (Good to Have) :- Experience with configuration management tools (e.g., Ansible, Chef, Puppet).- Familiarity with service mesh technologies (e.g., Istio, Linkerd).- Knowledge of database administration and performance tuning (SQL/NoSQL).- Certifications related to SRE, Cloud (e.g., AWS Certified DevOps Engineer), or Kubernetes (CKA, CKAD). (ref:hirist.tech)
-
Site Reliability Engineer
1 week ago
Chennai, India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – Azure & AIExperience: 7+ yearsWork Mode: HybridWork Location: Chennai/Mumbai/GurgaonJob Summary:We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure, AI infrastructure, and automation. The ideal candidate will have a solid background in managing cloud...
-
Site Reliability Engineer
1 week ago
Chennai, India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – Azure & AIExperience: 7+ yearsWork Mode: HybridWork Location: Chennai/Mumbai/GurgaonJob Summary:We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure, AI infrastructure, and automation. The ideal candidate will have a solid background in managing cloud...
-
Site Reliability Engineer
1 week ago
Chennai, India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – Azure & AIExperience: 7+ yearsWork Mode: HybridWork Location: Chennai/Mumbai/GurgaonJob Summary:We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure, AI infrastructure, and automation. The ideal candidate will have a solid background in managing cloud...
-
Site Reliability Engineer
2 weeks ago
Chennai, India Zyoin Group Full timeDescription : MoneyForward is seeking a Site Reliability Engineer (SRE) to lead the reliability, scalability, and performance of our products. This role involves making critical technical decisions, collaborating with development and platform engineering teams, and ensuring that our systems remain resilient and scalable to support stable business...
-
Site Reliability Engineer
3 weeks ago
chennai, India Tata Consultancy Services Full timeRole: Site Reliability Engineer Experience: 4 to 7 Years Locations: Chennai/Pune/Kolkata
-
Site Reliability Engineer
3 weeks ago
Chennai, India Tata Consultancy Services Full timeRole: Site Reliability Engineer Experience: 4 to 7 Years Locations: Chennai/Pune/Kolkata
-
Site Reliability Engineer
2 weeks ago
Chennai, India Tata Consultancy Services Full timeRole: Site Reliability Engineer Experience: 4 to 7 Years Locations: Chennai/Pune/Kolkata
-
Site Reliability Engineer
3 days ago
chennai, India Tata Consultancy Services Full timeRole: Site Reliability Engineer Experience: 4 to 7 Years Locations: Chennai/Pune/Kolkata
-
Site Reliability Engineer
4 days ago
chennai, India Tata Consultancy Services Full timeRole: Site Reliability Engineer Experience: 4 to 7 Years Locations: Chennai/Pune/Kolkata
-
Site Reliability Engineer
1 week ago
Chennai, India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – Azure & AI Experience: 7+ years Work Mode: Hybrid Work Location: Chennai/Mumbai/Gurgaon Job Summary: We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure, AI infrastructure, and automation. The ideal candidate will have a solid background in managing cloud...