Cloud/Site Reliability Engineer
1 week ago
AI Platform – Cloud/Site Reliability Engineer
Career CategoryInformation Systems Job DescriptionRole Description:
We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and platform components across cloud environments.
You will work closely with software engineers, ML engineers, and platform architects to design and implement robust monitoring, alerting, and incident response systems. You'll also contribute to automation of infrastructure provisioning, deployment pipelines, and performance tuning of AI workloads in production.
Roles & Responsibilities:
- Design and implement scalable, resilient cloud infrastructure to support AI/ML workloads.
- Develop and maintain observability tools including monitoring, logging, and alerting systems for AI platform services.
- Automate infrastructure provisioning and deployment using Infrastructure-as-Code (IaC) tools.
- Collaborate with engineering teams to ensure high availability and performance of AI services.
- Lead incident response and root cause analysis for platform outages or performance degradation.
- Implement security best practices and compliance controls across cloud environments.
- Optimize resource usage and cost efficiency of AI workloads in cloud environments.
- Participate in sprint planning and contribute to platform architecture and reliability strategy.
Must-Have Skills:
- Strong experience with cloud platforms (AWS, GCP, Azure) and cloud-native services.
- Proficiency in scripting and automation (Python, Bash, Terraform, etc.).
- Experience with containerization and orchestration (Docker, Kubernetes).
- Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK, Datadog).
- Understanding of CI/CD pipelines and DevOps practices.
- Experience with incident management, root cause analysis, and reliability engineering.
- Knowledge of security principles and cloud compliance frameworks.
- Ability to learn quickly, be organized and detail oriented.
Good-to-Have Skills:
- Exposure to AI/ML workloads and performance tuning for model inference and training.
- Experience with MLOps tools (MLflow, Kubeflow, Airflow).
- Familiarity with service mesh technologies (Istio, Linkerd).
- Experience with cost optimization strategies in cloud environments.
- Knowledge of distributed systems and fault-tolerant architecture.
Education and Professional Certifications:
- Bachelor's degree in computer science, Engineering, or related field.
- 5–9 years of experience in cloud infrastructure, DevOps, or SRE roles.
- Certifications in cloud platforms (AWS Solutions Architect, Azure Administrator, Google Cloud SRE) are a plus.
Soft Skills:
- Excellent analytical and troubleshooting skills.
- Strong verbal and written communication skills.
- Ability to work effectively with global, virtual teams.
- High degree of initiative and self-motivation.
- Ability to manage multiple priorities successfully.
- Team-oriented, with a focus on achieving team goals.
- Strong presentation and public speaking skills.
-
Site Reliability Engineer
1 week ago
Hyderabad, Telangana, India Oracle Financial Services Software Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per yearPrincipal Site Reliability Engineer Oracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and...
-
Site Reliability Engineer
24 hours ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSite Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...
-
Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India Technology Next Full time ₹ 20,00,000 - ₹ 30,00,000 per yearUrgently hiring for Site Reliability Engineer (SRE) / Chaos EngineerLocation: HyderabadJob Type: Full-time, PermanentJob Description:We are looking for an experienced Site Reliability Engineer (SRE) with strong Python automation skills (Boto3 required) and hands-on experience in chaos engineering to improve system reliability and resilience. The ideal...
-
Site Reliability Engineer
1 week ago
Hyderabad, Telangana, India SMARTWORK IT SERVICES Full time ₹ 12,00,000 - ₹ 24,00,000 per yearDescription : Role : Site Reliability Engineer (SRE). Location : Hyderabad. Experience : 10 to 15 Years. Job Summary : The Site Reliability Engineer (SRE) will play a critical role in ensuring the reliability, scalability, and performance of Citizens Banks enterprise systems and cloud environments. The ideal candidate brings deep technical...
-
Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India Apple Full time ₹ 15,00,000 - ₹ 25,00,000 per yearImagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn't have imagined — and now can't imagine living without. If you're motivated by the idea of making a real impact, and joining a team where we pride ourselves in being one of the most diverse...
-
Principal Site Reliability Engineer
4 days ago
Hyderabad, Telangana, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per yearOracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and monitoring to provide the stability,...
-
SRE(Site Reliability Engineer)
24 hours ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSRE (Site Reliability Engineer)Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services. Your work will involve both software engineering and systems operations as you strive to improve...
-
Site Reliability Engineer
1 week ago
Hyderabad, Telangana, India TurboHire Full time ₹ 15,00,000 - ₹ 28,00,000 per yearSite Reliability Engineer (SRE)Location: Hyderabad (Hybrid)Experience: 3–5 yearsAbout the RoleWe are looking for an SRE Engineer to own reliability, deployment, and monitoringof TurboHire's cloud infrastructure. You will ensure our platform is scalable, secure,and highly available. The role balances hands-on coding, automation, and infraoperations, freeing...
-
Lead Site Reliability Engineer
5 days ago
Hyderabad, Telangana, India EPAM Systems Full time ₹ 15,00,000 - ₹ 25,00,000 per yearWe are seeking a skilledLead Site Reliability Engineerto drive the stability, scalability, and reliability of our systems while improving efficiency through automation and best practices.This role calls for deep expertise in DevOps methodologies, Infrastructure as Code (IaC), and collaboration across teams to ensure optimal system...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India SID Global Solutions Full time ₹ 9,00,000 - ₹ 12,00,000 per yearJob Role: Site Reliability Engineer (SRE) – GCPExperience: 3+ yearsLocation: HyderabadAbout SIDGS:SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...