AI Platform – Cloud/Site Reliability Engineer

15 hours ago


Hyderabad, Telangana, India Amgen Full time ₹ 8,00,000 - ₹ 20,00,000 per year

Role Description:

We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and platform components across cloud environments.

You will work closely with software engineers, ML engineers, and platform architects to design and implement robust monitoring, alerting, and incident response systems. You'll also contribute to automation of infrastructure provisioning, deployment pipelines, and performance tuning of AI workloads in production.

Roles & Responsibilities:

  • Design and implement scalable, resilient cloud infrastructure to support AI/ML workloads.
  • Develop and maintain observability tools including monitoring, logging, and alerting systems for AI platform services.
  • Automate infrastructure provisioning and deployment using Infrastructure-as-Code (IaC) tools.
  • Collaborate with engineering teams to ensure high availability and performance of AI services.
  • Lead incident response and root cause analysis for platform outages or performance degradation.
  • Implement security best practices and compliance controls across cloud environments.
  • Optimize resource usage and cost efficiency of AI workloads in cloud environments.
  • Participate in sprint planning and contribute to platform architecture and reliability strategy.

Must-Have Skills:

  • Strong experience with cloud platforms (AWS, GCP, Azure) and cloud-native services.
  • Proficiency in scripting and automation (Python, Bash, Terraform, etc.).
  • Experience with containerization and orchestration (Docker, Kubernetes).
  • Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK, Datadog).
  • Understanding of CI/CD pipelines and DevOps practices.
  • Experience with incident management, root cause analysis, and reliability engineering.
  • Knowledge of security principles and cloud compliance frameworks.
  • Ability to learn quickly, be organized and detail oriented.

Good-to-Have Skills:

  • Exposure to AI/ML workloads and performance tuning for model inference and training.
  • Experience with MLOps tools (MLflow, Kubeflow, Airflow).
  • Familiarity with service mesh technologies (Istio, Linkerd).
  • Experience with cost optimization strategies in cloud environments.
  • Knowledge of distributed systems and fault-tolerant architecture.

Education and Professional Certifications:

  • Bachelor's degree in computer science, Engineering, or related field.
  • 5–9 years of experience in cloud infrastructure, DevOps, or SRE roles.
  • Certifications in cloud platforms (AWS Solutions Architect, Azure Administrator, Google Cloud SRE) are a plus.

Soft Skills:

  • Excellent analytical and troubleshooting skills.
  • Strong verbal and written communication skills.
  • Ability to work effectively with global, virtual teams.
  • High degree of initiative and self-motivation.
  • Ability to manage multiple priorities successfully.
  • Team-oriented, with a focus on achieving team goals.
  • Strong presentation and public speaking skills.


  • Hyderabad, Telangana, India Amgen Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Career CategoryInformation SystemsJob DescriptionRole Description:We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and...


  • Hyderabad, Telangana, India Amgen Technology Private Limited Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    AI Platform – Cloud/Site Reliability Engineer Career CategoryInformation Systems Job Description Role Description: We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure...


  • Hyderabad, Telangana, India Techxlnc Ai Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    we have open requirement for "SRE LEAD Engineer"client: MNC.PRODUSCT BASE US COMPANYRole & responsibilitiesResponsibilities:Architect, design, and deploy end-to-end infrastructure solutions for a multi-tenantmicroservices-based SaaS application with a focus on AI/ML model integration.Ensure system reliability, scalability, performance, and security,...


  • Hyderabad, Telangana, India Amgen Inc Full time ₹ 8,00,000 - ₹ 12,00,000 per year

    *What you will do* In this vital role you will responsible for the reliability, stability, performance, scalability, and security of platforms that support Amgens digital products and engineering teams. This hands-on role focuses on supporting cloud-based infrastructure, automating operations, maintaining observability, and improving platform reliability...


  • Hyderabad, Telangana, India Amgen Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    Join Amgen's Mission of Serving PatientsAt Amgen, if you feel like you're part of something bigger, it's because you are. Our shared mission—to serve patients living with serious illnesses—drives all that we do.Since 1980, we've helped pioneer the world of biotech in our fight against the world's toughest diseases. With our focus on four therapeutic...

  • AI Platform

    2 weeks ago


    Hyderabad, Telangana, India Amgen Inc Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Role Description:We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and platform components across cloud environments.You will...


  • Hyderabad, Telangana, India LivePerson Full time ₹ 8,00,000 - ₹ 15,00,000 per year

    LivePerson (NASDAQ: LPSN) is a leading customer engagement company, creating digital experiences powered by Curiously Human AI. Every person is unique, and our technology makes it possible for companies, including leading brands like HSBC, Orange, and GM Financial, to treat their audiences that way at scale. Nearly a billion conversational interactions are...


  • Hyderabad, Telangana, India Careernet Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    Key Skills: Cloud, Kubernetes, Python, Jenkins, OpenTelemetry, AppDynamics, Site Reliability Engineer.Roles & Responsibilities:Design, implement, and manage cloud infrastructure to ensure high availability and reliability.Utilize Kubernetes for container orchestration and management.Develop and maintain monitoring solutions using OpenTelemetry and...


  • Hyderabad, Telangana, India Instaresz Business Services Pvt Ltd Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    Job Title: Senior Site Reliability Engineer (SRE)Experience Required:10+ YearsLocation:Hyderabad (On-site)Employment Type:Full-TimeAbout InstareszInstaresz Business Services Pvt. Ltd. focuses on building and scalinghigh-performance SaaSproductswith expertise in:• SaaS Product Development• Infrastructure & DevOps• Data & Analytics• AI & AutomationOur...


  • Hyderabad, Telangana, India Warner Bros. Discovery Full time ₹ 12,00,000 - ₹ 24,00,000 per year

    Description : Welcome to Warner Bros. Discovery the stuff dreams are made of.Who We Are : When we say, the stuff dreams are made of, were not just referring to the world of wizards, dragons and superheroes, or even to the wonders of Planet Earth. Behind WBDs vast portfolio of iconic content and beloved brands, are the storytellers bringing our...