AI Platform – Cloud/Site Reliability Engineer

6 days ago


Hyderabad, Telangana, India Amgen Full time ₹ 8,00,000 - ₹ 20,00,000 per year

Role Description:

We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and platform components across cloud environments.

You will work closely with software engineers, ML engineers, and platform architects to design and implement robust monitoring, alerting, and incident response systems. You'll also contribute to automation of infrastructure provisioning, deployment pipelines, and performance tuning of AI workloads in production.

Roles & Responsibilities:

  • Design and implement scalable, resilient cloud infrastructure to support AI/ML workloads.
  • Develop and maintain observability tools including monitoring, logging, and alerting systems for AI platform services.
  • Automate infrastructure provisioning and deployment using Infrastructure-as-Code (IaC) tools.
  • Collaborate with engineering teams to ensure high availability and performance of AI services.
  • Lead incident response and root cause analysis for platform outages or performance degradation.
  • Implement security best practices and compliance controls across cloud environments.
  • Optimize resource usage and cost efficiency of AI workloads in cloud environments.
  • Participate in sprint planning and contribute to platform architecture and reliability strategy.

Must-Have Skills:

  • Strong experience with cloud platforms (AWS, GCP, Azure) and cloud-native services.
  • Proficiency in scripting and automation (Python, Bash, Terraform, etc.).
  • Experience with containerization and orchestration (Docker, Kubernetes).
  • Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK, Datadog).
  • Understanding of CI/CD pipelines and DevOps practices.
  • Experience with incident management, root cause analysis, and reliability engineering.
  • Knowledge of security principles and cloud compliance frameworks.
  • Ability to learn quickly, be organized and detail oriented.

Good-to-Have Skills:

  • Exposure to AI/ML workloads and performance tuning for model inference and training.
  • Experience with MLOps tools (MLflow, Kubeflow, Airflow).
  • Familiarity with service mesh technologies (Istio, Linkerd).
  • Experience with cost optimization strategies in cloud environments.
  • Knowledge of distributed systems and fault-tolerant architecture.

Education and Professional Certifications:

  • Bachelor's degree in computer science, Engineering, or related field.
  • 5–9 years of experience in cloud infrastructure, DevOps, or SRE roles.
  • Certifications in cloud platforms (AWS Solutions Architect, Azure Administrator, Google Cloud SRE) are a plus.

Soft Skills:

  • Excellent analytical and troubleshooting skills.
  • Strong verbal and written communication skills.
  • Ability to work effectively with global, virtual teams.
  • High degree of initiative and self-motivation.
  • Ability to manage multiple priorities successfully.
  • Team-oriented, with a focus on achieving team goals.
  • Strong presentation and public speaking skills.


  • Hyderabad, Telangana, India Amgen Full time ₹ 12,00,000 - ₹ 24,00,000 per year

    Career CategoryInformation SystemsJob DescriptionRole Description:We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and...


  • Hyderabad, Telangana, India Talent Worx Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Site Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...

  • Cloud Engineer

    2 weeks ago


    Hyderabad, Telangana, India Deep AI OCR Full time ₹ 5,00,000 - ₹ 15,00,000 per year

    Company DescriptionDeep AI OCRis an "AI First" company that aims to reimagine document processing in organizations by leveraging advanced AI LLMs. We provide efficient and accurate AI-powered optical character recognition solutions that enhance productivity for businesses of all sizes. Our solutions generate70% of the code using AI, and our trained Agentic...


  • Hyderabad, Telangana, India Amgen Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Career CategoryInformation SystemsJob Description Join Amgen's Mission of Serving PatientsAt Amgen, if you feel like you're part of something bigger, it's because you are. Our shared mission—to serve patients living with serious illnesses—drives all that we do.Since 1980, we've helped pioneer the world of biotech in our fight against the world's toughest...


  • Hyderabad, Telangana, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Oracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and monitoring to provide the stability,...


  • Hyderabad, Telangana, India Microsoft Full time ₹ 12,00,000 - ₹ 24,00,000 per year

    Join the Azure Specialized AI Infrastructure team in India to drive advancements in Artificial Intelligence (AI) and support high-performance infrastructure for generative AI workloads. As a Senior SRE, you will automate and maintain large-scale distributed systems powering latest AI applications and machine learning models. Your primary focus will be on the...


  • Hyderabad, Telangana, India JPMorganChase Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    DescriptionThere's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Corporate Oversight & Governance Team - Regulatory Controls Ops Risk...


  • Hyderabad, Telangana, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    DescriptionOracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and monitoring to provide the...


  • Hyderabad, Telangana, India Microsoft Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Join the Azure Specialized AI Infrastructure team in India to drive advancements in Artificial Intelligence (AI) and support high-performance infrastructure for generative AI workloads. As a Senior SRE, you will automate and maintain large-scale distributed systems powering latest AI applications and machine learning models. Your primary focus will be on the...


  • Hyderabad, Telangana, India Jade Global Full time ₹ 12,00,000 - ₹ 24,00,000 per year

    Senior Site Reliability Engineer (SRE) – Datadog Observability1Job Title: Senior Site Reliability Engineer (SRE) – Datadog ObservabilityExperience Required: 8+ years overall in SRE and Infrastructure Operations with minimum 3+ years hands-on experience in DatadogLocation: Hyderabad preferable but open for Pune and remoteJob Summary:We are seeking an...