
Cloud/Site Reliability Engineer
7 days ago
AI Platform – Cloud/Site Reliability Engineer
Career CategoryInformation Systems Job DescriptionRole Description:
We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and platform components across cloud environments.
You will work closely with software engineers, ML engineers, and platform architects to design and implement robust monitoring, alerting, and incident response systems. You'll also contribute to automation of infrastructure provisioning, deployment pipelines, and performance tuning of AI workloads in production.
Roles & Responsibilities:
- Design and implement scalable, resilient cloud infrastructure to support AI/ML workloads.
- Develop and maintain observability tools including monitoring, logging, and alerting systems for AI platform services.
- Automate infrastructure provisioning and deployment using Infrastructure-as-Code (IaC) tools.
- Collaborate with engineering teams to ensure high availability and performance of AI services.
- Lead incident response and root cause analysis for platform outages or performance degradation.
- Implement security best practices and compliance controls across cloud environments.
- Optimize resource usage and cost efficiency of AI workloads in cloud environments.
- Participate in sprint planning and contribute to platform architecture and reliability strategy.
Must-Have Skills:
- Strong experience with cloud platforms (AWS, GCP, Azure) and cloud-native services.
- Proficiency in scripting and automation (Python, Bash, Terraform, etc.).
- Experience with containerization and orchestration (Docker, Kubernetes).
- Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK, Datadog).
- Understanding of CI/CD pipelines and DevOps practices.
- Experience with incident management, root cause analysis, and reliability engineering.
- Knowledge of security principles and cloud compliance frameworks.
- Ability to learn quickly, be organized and detail oriented.
Good-to-Have Skills:
- Exposure to AI/ML workloads and performance tuning for model inference and training.
- Experience with MLOps tools (MLflow, Kubeflow, Airflow).
- Familiarity with service mesh technologies (Istio, Linkerd).
- Experience with cost optimization strategies in cloud environments.
- Knowledge of distributed systems and fault-tolerant architecture.
Education and Professional Certifications:
- Bachelor's degree in computer science, Engineering, or related field.
- 5–9 years of experience in cloud infrastructure, DevOps, or SRE roles.
- Certifications in cloud platforms (AWS Solutions Architect, Azure Administrator, Google Cloud SRE) are a plus.
Soft Skills:
- Excellent analytical and troubleshooting skills.
- Strong verbal and written communication skills.
- Ability to work effectively with global, virtual teams.
- High degree of initiative and self-motivation.
- Ability to manage multiple priorities successfully.
- Team-oriented, with a focus on achieving team goals.
- Strong presentation and public speaking skills.
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Jigya Software Services Full time ₹ 1,50,000 - ₹ 28,00,000 per yearJob Title:Senior Site Reliability Engineer (SRE) - AWS/KubernetesLocation:Hyderabad - OnsiteJob Type:Full-TimeAbout the Role:We are looking for a highly skilled and motivated Site Reliability Engineer to design, build, and maintain our high-performance, scalable cloud infrastructure. You will play a critical role in ensuring the reliability, performance, and...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Evalify-IQ Full time ₹ 6,00,000 - ₹ 18,00,000 per yearSkills Required:AWS, Azure, Terraform, CloudFormation, Cloudformation, Pulumi, CICD, GitHub Actions,GitLab CI, Jenkins, ArgoCD, Prometheus, Splunk, Grafana, Cloudwatch, Datadog, SRE,Site Reliability, Python, Powershell, Shell, Go, Kubernetes, Docker, Performance Tuning,Performance Enhancements, Performance Enhancement, PerformanceExperience Range:2 - 5...
-
Site Reliability Engineer
7 days ago
Hyderabad, Telangana, India TurboHire Full time ₹ 15,00,000 - ₹ 28,00,000 per yearSite Reliability Engineer (SRE)Location: Hyderabad (Hybrid)Experience: 3–5 yearsAbout the RoleWe are looking for an SRE Engineer to own reliability, deployment, and monitoringof TurboHire's cloud infrastructure. You will ensure our platform is scalable, secure,and highly available. The role balances hands-on coding, automation, and infraoperations, freeing...
-
Lead Site Reliability Engineer
2 days ago
Hyderabad, Telangana, India EPAM Systems Full time ₹ 15,00,000 - ₹ 25,00,000 per yearWe are seeking a skilledLead Site Reliability Engineerto drive the stability, scalability, and reliability of our systems while improving efficiency through automation and best practices.This role calls for deep expertise in DevOps methodologies, Infrastructure as Code (IaC), and collaboration across teams to ensure optimal system...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Oracle Financial Services Software Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSenior Principal Site Reliability Engineer, Fusion SRE About Oracle Cloud: Oracle Cloud is a comprehensive suite of cloud services—including infrastructure, platform, and applications—designed to help organizations build, deploy, and manage workloads securely at scale. At Oracle, we are building the most intelligent future of cloud computing. Our...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India SS&C TECHNOLOGIES Full time ₹ 5,00,000 - ₹ 12,00,000 per yearSite Reliability Engineer (PA2025Q3JB087) As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000 employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India SID Global Solutions Full time ₹ 9,00,000 - ₹ 12,00,000 per yearJob Role: Site Reliability Engineer (SRE) – GCPExperience: 3+ yearsLocation: HyderabadAbout SIDGS:SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
-
Principal Site Reliability Engineer
7 days ago
Hyderabad, Telangana, India Amgen Inc Full time ₹ 8,00,000 - ₹ 12,00,000 per yearWe are looking for a Site Reliability Engineer/Cloud Engineer (SRE) to work on the performance optimization, standardization, and automation of Amgens critical infrastructure and systems. This role is crucial to ensuring the reliability, scalability, and cost-effectiveness of our production systems. The ideal candidate will work on operational excellence...
-
Site Reliability Engineer
7 days ago
Hyderabad, Telangana, India Amgen Inc Full time ₹ 8,00,000 - ₹ 12,00,000 per year*What you will do* In this vital role you will responsible for the reliability, stability, performance, scalability, and security of platforms that support Amgens digital products and engineering teams. This hands-on role focuses on supporting cloud-based infrastructure, automating operations, maintaining observability, and improving platform reliability...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Apexsync Technologies Full time ₹ 12,00,000 - ₹ 36,00,000 per yearHello Everyone,We're looking for an experienced Site Reliability Engineer who excels in automation, cloud infrastructure, and observability solutions. The right candidate will combine technical depth with a proactive mindset to drive system reliability and performance.Location:Hyderabad (Hybrid Role. 2-3 days in office )Experience level:Senior ( 7 years and...