Senior Site Reliability Engineer

1 day ago

Hyderabad, Telangana, India Goldman Sachs Services Pvt Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Engineering-L2-Hyderabad-Vice President-Software Engineering

Senior Site Reliability Engineer (SRE) Job Description (12 Years Experience)

Short Description for Internal Candidates

The Senior Site Reliability Engineer (SRE) will serve as a technical leader and subject matter expert, responsible for defining, implementing, and optimizing the reliability, performance, and scalability of our most critical, large-scale distributed systems. This role requires a blend of deep technical expertise, strategic thinking, and the ability to mentor and guide other engineers, fostering a culture of operational excellence and continuous improvement across the engineering organization.

ABOUT GOLDMAN SACHS
At Goldman Sachs, we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow. Founded in 1869, we are a leading global investment banking, securities and investment management firm. Headquartered in New York, we maintain offices around the world. We believe who you are makes you better at what you do. We're committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs. Learn more about our culture, benefits, and people at

We are seeking highly skilled Senior C Developers with 8 to 10 years of experience to take ownership of critical aspects of the software Software Development Life Cycle (SDLC). The ideal candidates will have a strong background in C programming, experience mentoring junior developers, and a proactive approach to software upgrades and product enhancements. This role requires technical leadership, collaboration with cross-functional teams, and a deep understanding of system architecture and performance optimization.

Key Responsibilities:

Strategic Reliability Leadership: Lead the development and execution of SRE strategies, best practices, and roadmaps to enhance system reliability, availability, scalability, and efficiency across multiple domains or the entire platform.
Architectural Guidance & Design: Provide expert guidance and hands-on contributions in designing, building, and maintaining robust, fault-tolerant, and highly available architectures for distributed systems, including microservices and orchestrators. This includes influencing product and service roadmaps to ensure reliability is a first-class feature.
Advanced Monitoring & Observability: Architect and implement sophisticated monitoring, alerting, and logging systems to provide deep insights into system health, performance, and user experience. Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to drive continuous improvement and manage expectations.
Complex Incident Management & Resolution: Act as a primary point of contact and lead the response for major incidents, performing deep root cause analyses, and implementing strategic improvements to prevent recurrence. Drive a culture of blameless post-mortems and learning.
Advanced Automation & Toil Reduction: Champion and lead the development of advanced automation tools and frameworks to eliminate toil, streamline operational tasks, and improve overall system efficiency, including deployment, configuration management, and incident response.
Performance Engineering & Capacity Planning: Proactively identify and mitigate potential system risks, leading efforts in performance optimization, capacity planning, and efficiency improvements for large-scale production environments.
CI/CD & Release Excellence: Drive the evolution of Continuous Integration/Continuous Deployment (CI/CD) pipelines and release management processes, ensuring safe, efficient, and reliable software delivery at scale.
Mentorship & Technical Leadership: Mentor and guide junior and mid-level SREs and other engineering teams, fostering a culture of knowledge sharing, technical growth, and operational maturity. Provide expert advice on technical and business-related issues.
Cross-functional Collaboration: Collaborate extensively with development, product, security, and infrastructure teams to embed reliability practices throughout the software development lifecycle and ensure alignment with organizational goals.
Technology Evaluation & Adoption: Continuously evaluate emerging tools, technologies, and industry best practices, making recommendations and leading their adoption to enhance operational efficiency and reliability.

Qualifications:

Experience: 12 years of progressive experience in Site Reliability Engineering, Production Engineering, Software Development, or related roles with a strong focus on large-scale distributed production systems.
Technical Leadership: Demonstrated ability to lead technical initiatives, influence architectural decisions, and drive significant improvements in system reliability and performance.
Programming Mastery: Expert-level proficiency in multiple programming languages, such as Python, Go, Java, Ruby, or Bash, with a strong emphasis on writing high-quality, maintainable code for automation and tooling.
Operating Systems: Deep expertise in Linux/Unix operating systems and systems engineering.
Cloud Platforms: Extensive hands-on experience with major cloud providers (e.g., AWS, GCP, Azure) and designing cloud-native solutions.
Containerization & Orchestration: Mastery of container technologies (e.g., Docker) and advanced orchestration tools (e.g., Kubernetes), including designing and managing large-scale Kubernetes deployments.
CI/CD & IaC: Proven experience with advanced CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions) and Infrastructure as Code (IaC) principles and tools (e.g., Terraform, Ansible).
Monitoring & Observability Stack: Expertise in designing and implementing comprehensive monitoring, logging, and alerting solutions using tools like Prometheus, Grafana, Datadog, ELK Stack, Splunk, or similar.
Distributed Systems: In-depth understanding and experience with the design, development, and operation of complex distributed systems.
Networking: Advanced knowledge of networking concepts (TCP/IP, DNS, load balancing) and network observability.
Databases: Strong understanding of various database technologies (SQL and NoSQL) and data platforms, especially in a high-performance, high-availability context.
Problem-Solving: Exceptional analytical, problem-solving, and debugging skills for complex, multi-layered systems.
Communication: Excellent written and verbal communication skills, with the ability to articulate complex technical concepts to diverse audiences, including executive leadership.
Interpersonal Skills: Strong ability to collaborate effectively across teams, influence stakeholders, and lead technical discussions.

Preferred Qualifications:

Experience with chaos engineering principles and practices.
Familiarity with compliance and security best practices in large-scale environments.
Contributions to open-source SRE tooling or publications.
Experience in a regulated industry (e.g., financial services, healthcare).

The Goldman Sachs Group, Inc., 2023. All rights reserved. Goldman Sachs is an equal opportunity employer and does not discriminate on the basis of race, color, religion, sex, national origin, age, veterans status, disability, or any other characteristic protected by applicable law Experience LevelExecutive Level

Senior Site Reliability Engineer

5 days ago

Hyderabad, Telangana, India Instaresz Business Services Pvt Ltd Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Job Title: Senior Site Reliability Engineer (SRE)Experience Required:10+ YearsLocation:Hyderabad (On-site)Employment Type:Full-TimeAbout InstareszInstaresz Business Services Pvt. Ltd. focuses on building and scalinghigh-performance SaaSproductswith expertise in:• SaaS Product Development• Infrastructure & DevOps• Data & Analytics• AI & AutomationOur...
Senior Site Reliability Engineer

5 days ago

Hyderabad, Telangana, India Microsoft Full time ₹ 12,00,000 - ₹ 36,00,000 per year

The Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform, as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level.Windows 365 Cloud PC (W365) and Azure Virtual Desktop (AVD) have recently been recognized as leaders in the Gartner Magic Quadrant for Desktop...
Senior Site Reliability Engineer

6 days ago

Hyderabad, Telangana, India Microsoft Full time ₹ 12,00,000 - ₹ 36,00,000 per year

The Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform, as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level.Windows 365 Cloud PC (W365) and Azure Virtual Desktop (AVD) have recently been recognized as leaders in the Gartner Magic Quadrant for Desktop...
Senior Site Reliability Engineer

1 day ago

Hyderabad, Telangana, India CyberArk Full time

Company DescriptionAbout CyberArk:CyberArk (NASDAQ: CYBR), is the global leader in Identity Security. Centered on privileged access management, CyberArk provides the most comprehensive security offering for any identity – human or machine – across business applications, distributed workforces, hybrid cloud workloads and throughout the DevOps lifecycle....
Site Reliability Engineer

2 weeks ago

Hyderabad, Telangana, India, Telangana SID Global Solutions Full time

Job Role: Site Reliability Engineer (SRE) – GCPExperience: 3+ yearsLocation: HyderabadAbout SIDGS:SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
Senior Site Reliability Engineer

6 days ago

Hyderabad, Telangana, India Cubic Corporation Full time ₹ 1,04,000 - ₹ 1,30,878 per year

Business Unit:Cubic Transportation SystemsCompany Details:When you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led by our...
Senior Site Reliability Engineer

2 weeks ago

Hyderabad, Telangana, India Cubic Corporation Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Business Unit:Cubic Transportation SystemsCompany Details:When you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led by our...
Senior Site Reliability Engineer

5 days ago

Hyderabad, Telangana, India Cubic Defense Full time ₹ 1,04,000 - ₹ 1,30,878 per year

Business UnitCubic Transportation SystemsCompany DetailsWhen you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led by our talented...
Senior Site Reliability Engineer

6 days ago

Hyderabad, Telangana, India Cubic Corporation Full time ₹ 1,04,000 - ₹ 1,30,878 per year

Business Unit:Cubic Transportation SystemsCompany Details:When you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led by our...
Site Reliability Engineer

2 weeks ago

Hyderabad, Telangana, India Amgen Inc Full time ₹ 8,00,000 - ₹ 12,00,000 per year

*What you will do* In this vital role you will responsible for the reliability, stability, performance, scalability, and security of platforms that support Amgens digital products and engineering teams. This hands-on role focuses on supporting cloud-based infrastructure, automating operations, maintaining observability, and improving platform reliability...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer