Senior Site Reliability Engineer

14 hours ago


Hyderabad, Telangana, India Goldman Sachs Services Pvt Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Engineering-L2-Hyderabad-Vice President-Software Engineering

Senior Site Reliability Engineer (SRE) Job Description (12 Years Experience)

Short Description for Internal Candidates

The Senior Site Reliability Engineer (SRE) will serve as a technical leader and subject matter expert, responsible for defining, implementing, and optimizing the reliability, performance, and scalability of our most critical, large-scale distributed systems. This role requires a blend of deep technical expertise, strategic thinking, and the ability to mentor and guide other engineers, fostering a culture of operational excellence and continuous improvement across the engineering organization.

ABOUT GOLDMAN SACHS
At Goldman Sachs, we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow. Founded in 1869, we are a leading global investment banking, securities and investment management firm. Headquartered in New York, we maintain offices around the world. We believe who you are makes you better at what you do. We're committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs. Learn more about our culture, benefits, and people at

We are seeking highly skilled Senior C Developers with 8 to 10 years of experience to take ownership of critical aspects of the software Software Development Life Cycle (SDLC). The ideal candidates will have a strong background in C programming, experience mentoring junior developers, and a proactive approach to software upgrades and product enhancements. This role requires technical leadership, collaboration with cross-functional teams, and a deep understanding of system architecture and performance optimization.

Key Responsibilities:

  • Strategic Reliability Leadership: Lead the development and execution of SRE strategies, best practices, and roadmaps to enhance system reliability, availability, scalability, and efficiency across multiple domains or the entire platform.
  • Architectural Guidance & Design: Provide expert guidance and hands-on contributions in designing, building, and maintaining robust, fault-tolerant, and highly available architectures for distributed systems, including microservices and orchestrators. This includes influencing product and service roadmaps to ensure reliability is a first-class feature.
  • Advanced Monitoring & Observability: Architect and implement sophisticated monitoring, alerting, and logging systems to provide deep insights into system health, performance, and user experience. Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to drive continuous improvement and manage expectations.
  • Complex Incident Management & Resolution: Act as a primary point of contact and lead the response for major incidents, performing deep root cause analyses, and implementing strategic improvements to prevent recurrence. Drive a culture of blameless post-mortems and learning.
  • Advanced Automation & Toil Reduction: Champion and lead the development of advanced automation tools and frameworks to eliminate toil, streamline operational tasks, and improve overall system efficiency, including deployment, configuration management, and incident response.
  • Performance Engineering & Capacity Planning: Proactively identify and mitigate potential system risks, leading efforts in performance optimization, capacity planning, and efficiency improvements for large-scale production environments.
  • CI/CD & Release Excellence: Drive the evolution of Continuous Integration/Continuous Deployment (CI/CD) pipelines and release management processes, ensuring safe, efficient, and reliable software delivery at scale.
  • Mentorship & Technical Leadership: Mentor and guide junior and mid-level SREs and other engineering teams, fostering a culture of knowledge sharing, technical growth, and operational maturity. Provide expert advice on technical and business-related issues.
  • Cross-functional Collaboration: Collaborate extensively with development, product, security, and infrastructure teams to embed reliability practices throughout the software development lifecycle and ensure alignment with organizational goals.
  • Technology Evaluation & Adoption: Continuously evaluate emerging tools, technologies, and industry best practices, making recommendations and leading their adoption to enhance operational efficiency and reliability.

Qualifications:

  • Experience: 12 years of progressive experience in Site Reliability Engineering, Production Engineering, Software Development, or related roles with a strong focus on large-scale distributed production systems.
  • Technical Leadership: Demonstrated ability to lead technical initiatives, influence architectural decisions, and drive significant improvements in system reliability and performance.
  • Programming Mastery: Expert-level proficiency in multiple programming languages, such as Python, Go, Java, Ruby, or Bash, with a strong emphasis on writing high-quality, maintainable code for automation and tooling.
  • Operating Systems: Deep expertise in Linux/Unix operating systems and systems engineering.
  • Cloud Platforms: Extensive hands-on experience with major cloud providers (e.g., AWS, GCP, Azure) and designing cloud-native solutions.
  • Containerization & Orchestration: Mastery of container technologies (e.g., Docker) and advanced orchestration tools (e.g., Kubernetes), including designing and managing large-scale Kubernetes deployments.
  • CI/CD & IaC: Proven experience with advanced CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions) and Infrastructure as Code (IaC) principles and tools (e.g., Terraform, Ansible).
  • Monitoring & Observability Stack: Expertise in designing and implementing comprehensive monitoring, logging, and alerting solutions using tools like Prometheus, Grafana, Datadog, ELK Stack, Splunk, or similar.
  • Distributed Systems: In-depth understanding and experience with the design, development, and operation of complex distributed systems.
  • Networking: Advanced knowledge of networking concepts (TCP/IP, DNS, load balancing) and network observability.
  • Databases: Strong understanding of various database technologies (SQL and NoSQL) and data platforms, especially in a high-performance, high-availability context.
  • Problem-Solving: Exceptional analytical, problem-solving, and debugging skills for complex, multi-layered systems.
  • Communication: Excellent written and verbal communication skills, with the ability to articulate complex technical concepts to diverse audiences, including executive leadership.
  • Interpersonal Skills: Strong ability to collaborate effectively across teams, influence stakeholders, and lead technical discussions.

Preferred Qualifications:

  • Experience with chaos engineering principles and practices.
  • Familiarity with compliance and security best practices in large-scale environments.
  • Contributions to open-source SRE tooling or publications.
  • Experience in a regulated industry (e.g., financial services, healthcare).
The Goldman Sachs Group, Inc., 2023. All rights reserved. Goldman Sachs is an equal opportunity employer and does not discriminate on the basis of race, color, religion, sex, national origin, age, veterans status, disability, or any other characteristic protected by applicable law Experience LevelExecutive Level

  • Hyderabad, Telangana, India JA Consulting Full time

    About the job : Role : Senior Site Reliability Engineer SaaS Real Estate Platform About the Client : We are hiring on behalf of our reputed SaaS product-based client based in Hyderabad. They are a global leader in real estate software development.The Role : Were seeking a Senior Site Reliability Engineer (SRE) with a strong Software Engineering background...


  • Hyderabad, Telangana, India Microsoft Full time

    The Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level Windows 365 Cloud PC W365 and Azure Virtual Desktop AVD have recently been recognized as leaders in the Gartner Magic Quadrant TM for...


  • Hyderabad, Telangana, India Microsoft Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    The Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform, as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level.Windows 365 Cloud PC (W365) and Azure Virtual Desktop (AVD) have recently been recognized as leaders in the Gartner Magic Quadrant for Desktop...


  • Hyderabad, Telangana, India Microsoft Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    The Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform, as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level.Windows 365 Cloud PC (W365) and Azure Virtual Desktop (AVD) have recently been recognized as leaders in the Gartner Magic Quadrant for Desktop...


  • Hyderabad, Telangana, India INDIGLOBE IT SOLUTIONS PRIVATE LIMITED Full time

    Job Summary :We are looking for a Senior Site Reliability Engineer (SRE) to join our growing Engineering team. As an SRE, you will play a key role in ensuring the reliability, scalability, and performance of our production systems across a multi-cloud environment (GCP & AWS). Youll be responsible for owning application support, maintaining our microservices...


  • Hyderabad, Telangana, India Talent Worx Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    Site Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...


  • Hyderabad, Telangana, India Chase Bank Full time

    Job DescriptionGuide and shape the future of technology at a globally recognized firm, driven by pride in ownership.As a Senior Manager of Site Reliability Engineering at JPMorgan Chase within the Consumer & Community Banking, youare the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team's...


  • Hyderabad, Telangana, India Cubic Corporation Full time

    Job DescriptionBusiness Unit:Cubic Transportation SystemsCompany Details:When you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led...


  • Hyderabad, Telangana, India Cubic Corporation Full time ₹ 1,20,000 - ₹ 2,60,000 per year

    Business Unit:Cubic Transportation SystemsCompany Details:When you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led by our...


  • Hyderabad, Telangana, India Cubic Corporation Full time ₹ 1,04,000 - ₹ 1,30,878 per year

    Business Unit:Cubic Transportation SystemsCompany Details:When you join Cubic, you become part of a company that creates and delivers technology solutions in transportation to make people's lives easier by simplifying their daily journeys, and defense capabilities to help promote mission success and safety for those who serve their nation. Led by our...