Site reliability engineer

4 days ago


India Xebia Full time

We are looking for a highly skilled AWS Engineer with strong Python development and Chaos Engineering expertise to design, build, and validate resilient, scalable, and automated cloud-native environments. The ideal candidate will combine cloud engineering, Dev Ops, and chaos experimentation to improve reliability, fault tolerance, and operational efficiency of critical systems. Key Responsibilities Cloud Engineering (AWS): Architect, implement, and manage secure, scalable, and cost-efficient AWS infrastructure (EC2, Lambda, EKS, S3, RDS, IAM, Cloud Front, etc.). Automate infrastructure provisioning and configuration using Terraform / Cloud Formation and AWS SDKs. Manage containerized workloads (Docker, Kubernetes, EKS). Python Development: Build automation scripts, deployment utilities, and infrastructure tooling using Python (Boto3, Flask, Fast API, etc.). Develop custom monitoring/alerting integrations with APIs, SDKs, and third-party observability platforms. Implement self-healing and resilience-focused automation scripts. Chaos Engineering & Resiliency: Design and execute chaos experiments (fault injection, latency, outages, resource failures) to validate system resilience. Use tools like Gremlin, Litmus, Chaos Mesh, or AWS Fault Injection Simulator . Partner with SRE and development teams to define SLIs, SLOs, and error budgets . Document learnings from chaos tests and improve incident response & recovery playbooks. Dev Ops & Observability: Build and maintain CI/CD pipelines for automated deployments (Jenkins, Git Hub Actions, Git Lab CI, AWS Code Pipeline). Integrate observability frameworks (Prometheus, Grafana, ELK/EFK, Cloud Watch, Datadog) for monitoring and tracing. Ensure proactive alerting and real-time visibility into system health. Security & Compliance: Apply AWS security best practices for IAM, networking, and data protection. Ensure compliance with internal and external regulatory frameworks (SOC2, ISO, GDPR, etc.). Required Skills & Qualifications 6–10 years of experience in Cloud, Dev Ops, or SRE roles. Strong hands-on expertise in AWS Cloud (certifications preferred: AWS Dev Ops Engineer / Solutions Architect). Advanced Python development skills for automation and tooling (Boto3 a must). Experience designing and running chaos experiments (Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection). Solid knowledge of Ia C (Terraform / Cloud Formation). Proficiency in containers & orchestration (Docker, Kubernetes, EKS). Strong background in monitoring, observability, and incident management . Familiarity with Dev Ops toolchain (CI/CD, Git, Jenkins, Git Lab, Code Pipeline). Good understanding of resilient architectures, reliability principles, and disaster recovery . Preferred Skills Knowledge of Go / Shell scripting in addition to Python. Experience with chaos testing in production-like environments . Exposure to multi-cloud or hybrid-cloud environments . Strong problem-solving and debugging skills. What We Offer Opportunity to lead cloud reliability & chaos engineering initiatives . Culture focused on automation, resilience, and continuous improvement . Growth opportunities through certifications, R& D projects, and leadership roles.



  • India Elgebra Full time

    Hiring: Site Reliability Engineer – 7+ Years Location: Bangalore / Chennai Payroll: Elgebra


  • India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or Dev Ops Engineering roles Location – Fully Remote Type - 6 months Contract Work Ex - 5+ Yrs We’re working with a AI product company that’s building the next generation of Gen AI powered developer platforms . We’re looking for an experienced Site Reliability Engineer to join...


  • India Concord Full time

    SRE Sr. Engineers (Individual Contributors) Key Attributes: - Strong SRE (Site Reliability Engineering) experience - DevOps skills – CI/CD, monitoring, automation, infrastructure as code, etc. - Excellent troubleshooting and debugging skills (infrastructure + application level) - Perseverance – must push through complex/challenging issues without...


  • India Concord Full time

    SRE Sr. Engineers (Individual Contributors) Key Attributes : Strong SRE (Site Reliability Engineering) experience Dev Ops skills – CI/CD, monitoring, automation, infrastructure as code, etc. Excellent troubleshooting and debugging skills (infrastructure + application level) Perseverance – must push through complex/challenging issues without...


  • India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering rolesLocation – Bangalore/ RemoteType - ContractWork Ex - 4-6 yrsWe're working with a AI product company that's building the next generation of GenAI powered developer platforms.We're looking for an experienced Site Reliability Engineer to join their Platform Engineering...


  • India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering roles Location – Bangalore/ Remote Type - Contract Work Ex - 4-6 yrs We're working with a AI product company that's building the next generation of GenAI powered developer platforms . We're looking for an experienced Site Reliability Engineer to join their Platform...


  • india Synechron Full time

    We have immediate opportunity forSRE (Senior Site Reliability Engineer) 5 to 9 years. Synechron –BangaloreJob Role: -SRE (Senior Site Reliability Engineer) Job Location: -Bangalore Notice Period:Within 30daysAbout Synechron We began life in 2001 as a small, self-funded team of technology specialists. Since then, we’ve grown our organization to 14,500+...


  • India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering roles Location – Fully Remote Type - 6 months Contract Work Ex - 5+ Yrs We’re working with a AI product company that’s building the next generation of GenAI powered developer platforms . We’re looking for an experienced Site Reliability...


  • India Concord Full time

    SRE Sr. Engineers (Individual Contributors) Key Attributes : Strong SRE (Site Reliability Engineering) experience DevOps skills – CI/CD, monitoring, automation, infrastructure as code, etc. Excellent troubleshooting and debugging skills (infrastructure + application level) Perseverance – must push through complex/challenging issues without giving up...


  • India Akamai Full time

    Do you want to grow your career in Linux and Site Reliability Engineering? Would you like to contribute to the foundation of a new public cloud platform? Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate infrastructure and services that power the backbone of our cloud platform. This is a rare opportunity to help build a...