
Site Reliability Engineer
4 weeks ago
We are seeking a highly skilled and self-driven Site Reliability Engineer to join our dynamic team. This role is ideal for someone with a strong foundation in Kubernetes, DevOps, and observability who can also support machine learning infrastructure, GPU optimization, and Big Data ecosystems. You will play a pivotal role in ensuring the reliability, scalability, and performance of our production systems, while also enabling innovation across ML and data teams.
Key Responsibilities:- Infrastructure Automation & Reliability
- Design, build, and maintain Kubernetes clusters across hybrid or cloud environments (e.g., EKS, GKE, AKS).
- Implement and optimize CI/CD pipelines using tools like Jenkins, ArgoCD, and GitHub Actions.
- Develop and maintain Infrastructure as Code (IaC) using Ansible, Terraform, or equivalent.
Monitoring & Observability
- Deploy and maintain monitoring, logging, and tracing tools (e.g., Thanos, Prometheus, Grafana, Loki, Jaeger).
- Establish proactive alerting and observability practices to identify and address issues before they impact users.
ML Ops & GPU Optimization
- Support and scale ML workflows using tools like Kubeflow, MLflow, and TensorFlow Serving.
- Work with data scientists to ensure efficient use of GPU resources, optimizing training and inference pipelines.
Performance & Incident Management
- Lead root cause analysis for infrastructure and application-level incidents.
- Participate in the on-call rotation and improve incident response processes.
Scripting & Automation
- Automate operational tasks and service deployment using Python, Shell, Groovy, or Ansible.
- Write reusable scripts and tools to improve team productivity and reduce manual toil.
- Continuous Learning & Collaboration
- Stay up-to-date with emerging technologies in SRE, ML Ops, and observability.
- Collaborate with cross-functional teams including engineering, data science, and security to ensure system integrity and scalability.
Must-Have:
- 3+ years of experience as an SRE, DevOps Engineer, or equivalent role.
- Strong experience with Kubernetes ecosystem and container orchestration.
- Proficiency in DevOps tooling including Jenkins, ArgoCD, and GitOps workflows.
- Deep understanding of observability tools, with hands-on experience using Thanos and Prometheus stack.
- Experience with ML platforms (MLflow, Kubeflow) and supporting GPU workloads.
- Strong scripting skills in Python, Shell, Ansible, or Groovy.
Preferred:
- CKS (Certified Kubernetes Security Specialist) certification.
- Exposure to Big Data platforms (e.g., Spark, Kafka, Hadoop).
- Experience with cloud-native environments (AWS, GCP, or Azure).
- Background in infrastructure security and compliance.
-
Site Reliability Engineer
4 days ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 9,00,000 - ₹ 12,00,000 per yearSite Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...
-
Site Reliability Engineer
3 weeks ago
Hyderabad, Telangana, India Talent Worx Full timeTalent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services.Your work will involve both software engineering and systems operations as you strive to improve customer experiences and operational...
-
Site Reliability Engineer
5 days ago
Hyderabad, Telangana, India IntraEdge Full timeSite Reliability EngineerExperience: 7+ YearsLocation: HyderabadHybrid 4-day office and 1 Day remoteSkills for Principal:Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Advanced project management capabilities.Excellent communication and collaboration skills.Adept at risk assessment and crisis...
-
Site Reliability Engineer
4 weeks ago
Hyderabad, Telangana, India IntraEdge Full timePosition - SRE (Site Reliability Engineer)Experience - 5+ YearsLocation - HyderabadSkills for Principal:Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Advanced project management capabilities.Excellent communication and collaboration skills.Adept at risk assessment and crisis...
-
Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India IntraEdge Full timeSite Reliability EngineerExperience: 7+ YearsLocation: HyderabadHybrid 4-day office and 1 Day remoteSkills for Principal:- Strong leadership and people management skills.- Exceptional technical proficiency in Pearson's technology stack.- Advanced project management capabilities.- Excellent communication and collaboration skills.- Adept at risk assessment and...
-
Site Reliability Engineer
2 hours ago
Hyderabad, Telangana, India IntraEdge Full timeSite Reliability Engineer Experience: 7+ Years Location: Hyderabad Skills for Principal: ~ Strong leadership and people management skills. ~ Exceptional technical proficiency in Pearson's technology stack. ~ Advanced project management capabilities. ~ Excellent communication and collaboration skills. ~ Adept at risk assessment and crisis management. ~...
-
Site Reliability Engineer
24 hours ago
Hyderabad, Telangana, India IntraEdge Full timeSite Reliability Engineer Experience: 7+ Years Location: Hyderabad Hybrid 4-day office and 1 Day remote Skills for Principal: Strong leadership and people management skills. Exceptional technical proficiency in Pearson's technology stack. Advanced project management capabilities. Excellent communication and collaboration skills. Adept at risk assessment...
-
SRE(Site Reliability Engineer)
4 days ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 15,00,000 - ₹ 20,00,000 per yearSRE (Site Reliability Engineer)Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services. Your work will involve both software engineering and systems operations as you strive to improve...
-
Site Reliability Engineer
4 weeks ago
Hyderabad, Telangana, India VXI Global Solutions Full timeWe are seeking a skilled Site Reliability Engineer with 4 to 8 years for Experience into design, implement, and manage robust observability solutions across our cloud infrastructure and applications. The ideal candidate will have hands-on experience with Prometheus, Grafana, Google Cloud Monitoring, and OpenTelemetry, along with exposure to SolarWinds. You...
-
Lead Site Reliability Engineer
4 days ago
Hyderabad, Telangana, India JP Morgan Chase & Co. Full timeJob DescriptionAssume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.As a Lead Site Reliability Engineer at JPMorgan Chase within the Consumer & Community Banking Team, you will take the lead in conducting resiliency design reviews, break...