Site Reliability Engineer II
1 week ago
Join the Azure Specialized AI Infrastructure team in India to drive advancements in Artificial Intelligence (AI) and support high-performance infrastructure for generative AI workloads. As a Senior SRE, you will automate and maintain large-scale distributed systems powering latest AI applications and machine learning models. Your primary focus will be on the reliability, scalability, and performance of AI infrastructure, ensuring seamless operations for mission-critical AI services. The role emphasizes a start-up mindset, collaboration, and customer advocacy.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities- Reliability: Ensure the reliability, scalability, and security of AI infrastructure supporting HPC & AI workloads.
- Incident Management: Lead incident response, root cause analysis, and continuous improvement to minimize downtime and optimize service availability.
- Performance Optimization: Identify and resolve bottlenecks in compute, storage, networking, and specialized hardware (GPUs, InfiniBand) to enhance AI system performance.
- Infrastructure Automation: Develop and maintain automation tools for deployment, monitoring, predictive analysis and management of AI infrastructure, including containerized environments (Kubernetes, Docker).
- Technical Leadership: Provide technical guidance in cloud and AI infrastructure technologies, collaborating with cross-functional teams to drive innovation and best practices.
- Customer Advocacy: Act as a customer advocate, focusing on service excellence and live site reliability for AI workloads.
- Research & Innovation: Stay informed on emerging AI infrastructure technologies and industry trends, recommending adoption where beneficial.
Required/Minimum Qualifications:
- 4+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field.
- Proven ability to modify componentized, well-architected infrastructure software and collaborate across teams.
- Proficient technical design, analytical, and debugging abilities.
- 1+ years experience with incident management and reliability engineering in cloud or AI environments.
- Excellent interpersonal, communication, and collaboration skills.
Other Requirements:
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Additional or Preferred Qualifications:
- 5+ years technical experience in software engineering, network engineering,
- OR systems administration
- OR Bachelor's Degree in Computer Science, Information Technology,
- OR related field AND 2+ years technical experience in software engineering, network engineering,
- OR systems administration
- OR Master's Degree in Computer Science, Information Technology,
- OR related field AND 1+ year(s) technical experience in software engineering, network engineering,
- 1+ year(s) people management experience.
- Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker, containers ecosystem).
- Experience with GPUs, InfiniBand, or similar high-performance technologies.
- Proficiency in RDMA (Remote Direct Memory Access), MPI (Message Passing Interface), and high-performance computing architecture.
- Proficient in scripting (PowerShell, Shell script, etc.) and deep expertise in Linux.
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.
-
Site Reliability Engineer II
11 hours ago
Hyderabad, Telangana, India NCR Atleos Full time ₹ 9,00,000 - ₹ 12,00,000 per yearAbout NCR AtleosNCR Atleos, headquartered in Atlanta, is a leader in expanding financial access. Our dedicated 20,000 employees optimize the branch, improve operational efficiency and maximize self-service availability for financial institutions and retailers across the globe.Job Title: Site Reliability Engineer IILocation: HyderabadJob Type: Full-Time, 24*7...
-
Site Reliability Engineer
4 days ago
Hyderabad, Telangana, India Oracle Financial Services Software Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per yearPrincipal Site Reliability Engineer Oracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Jigya Software Services Full time ₹ 1,50,000 - ₹ 28,00,000 per yearJob Title:Senior Site Reliability Engineer (SRE) - AWS/KubernetesLocation:Hyderabad - OnsiteJob Type:Full-TimeAbout the Role:We are looking for a highly skilled and motivated Site Reliability Engineer to design, build, and maintain our high-performance, scalable cloud infrastructure. You will play a critical role in ensuring the reliability, performance, and...
-
Site Reliability Engineer
4 days ago
Hyderabad, Telangana, India SMARTWORK IT SERVICES Full time ₹ 12,00,000 - ₹ 24,00,000 per yearDescription : Role : Site Reliability Engineer (SRE). Location : Hyderabad. Experience : 10 to 15 Years. Job Summary : The Site Reliability Engineer (SRE) will play a critical role in ensuring the reliability, scalability, and performance of Citizens Banks enterprise systems and cloud environments. The ideal candidate brings deep technical...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Evalify-IQ Full time ₹ 6,00,000 - ₹ 18,00,000 per yearSkills Required:AWS, Azure, Terraform, CloudFormation, Cloudformation, Pulumi, CICD, GitHub Actions,GitLab CI, Jenkins, ArgoCD, Prometheus, Splunk, Grafana, Cloudwatch, Datadog, SRE,Site Reliability, Python, Powershell, Shell, Go, Kubernetes, Docker, Performance Tuning,Performance Enhancements, Performance Enhancement, PerformanceExperience Range:2 - 5...
-
Principal Site Reliability Engineer
8 hours ago
Hyderabad, Telangana, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per yearOracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Mainframe zLinux, DB2, zVM, AIX. Site Reliability Engineer expected to work with multiple service and product development teams, identifying cross-team issues that...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India SS&C TECHNOLOGIES Full time ₹ 5,00,000 - ₹ 12,00,000 per yearSite Reliability Engineer (PA2025Q3JB087) As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000 employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for...
-
Principal Site Reliability Engineer
3 hours ago
Hyderabad, Telangana, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per yearOracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and monitoring to provide the stability,...
-
Lead Site Reliability Engineer
4 days ago
Hyderabad, Telangana, India Opentext Full time US$ 90,000 - US$ 1,20,000 per yearThe Opportunity:The role of Site Reliability Engineer is to build solutions to enhance the availability, performance and stability of OpenText services as well as automating away repetitive work.You are great at:Provide attention to incidents according to Service Level Agreements.Take ownership and accountability for the incident resolution process.Exhibit...
-
Site Reliability Engineer
1 week ago
Hyderabad, Telangana, India SID Global Solutions Full time ₹ 9,00,000 - ₹ 12,00,000 per yearJob Role: Site Reliability Engineer (SRE) – GCPExperience: 3+ yearsLocation: HyderabadAbout SIDGS:SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...