Senior Site Reliability Engineer

3 weeks ago


bangalore, India SambaNova Systems Full time
The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale.
SambaNova Suite™ is the first full-stack, generative AI platform, from chip to model, optimized for enterprise and government organizations. Powered by the intelligent SN40L chip, the SambaNova Suite is a fully integrated platform, delivered on-premises or in the cloud, combined with state-of-the-art open-source models that can be easily and securely fine-tuned using customer data for greater accuracy. Once adapted with customer data, customers retain model ownership in perpetuity, so they can turn generative AI into one of their most valuable assets.
SambaNova’s mission is to be the number 1 platform for business AI. We are a full-stack provider of AI-specific chips, software, and models that come together to help every organization accelerate their AI journey.
This role presents a unique opportunity to shape the future of AI and the value it can unlock across every aspect of an organization’s business and operations, including building, securing, operating, and scaling the platform and infrastructure that enable us to deliver our groundbreaking capabilities to enterprise customers.
Job Description
As a site reliability engineer on the operations team, you will be solving interesting challenges in a fast paced environment by designing, deploying, and troubleshooting state of the art AI platforms and services with great attention to reliability, security, scalability, operability, and performance. Working alongside engineering teams that are building cutting edge technologies revolutionizing the AI landscape, you will leverage your experience across software, systems, infrastructure, and production operations to lead key initiatives that enable us to rapidly deliver reliable and scalable service for customers in a hybrid deployment pattern.
The ideal candidate for this highly visible and critical role will have the knowledge of a software engineer, the experience of a systems and infrastructure engineer, and a strong passion for troubleshooting and automation across bare metal datacenter infrastructure and public cloud services.
This individual will be responsible for
Assume full-stack ownership for the successful delivery of our SambaNova services in a hybrid model, including, but not limited to, deployment, configuration, integrations, observability, and ongoing operations
Develop deep understanding of the end-to-end configurations, dependencies, customer requirements, and overall characteristics of the production services as the accountable owner for overall service operations
Systems and application administration for multiple customer facing production environments (hosted and on-premise), with a continued focus on improving efficiencies, availability, and supportability through automation and well defined run-books
Partner and collaborate with product and engineering teams to recommend and implement improvements to the security, resilience, and operational readiness of our systems, with the flexibility to integrate into unique customer environments
Augment ongoing efforts to design and develop automation for deployments, updates and upgrades of the entire SambaNova software stack
Lead efforts to triage, debug, and fix issues related to networks, storage, operating systems, containers, and applications to drive proactive and reactive incident resolution and root cause analysis
Build the systems and tools for centralized command and control of distributed environments
Participate in on-call rotation responsibilities
Basic qualifications
Bachelors and/or Masters in CS or related field
10+ years of hands-on experience in SRE / Production engineering roles with focus on supporting, scaling and ensuring the reliability of large-scale production services and infrastructure
Extensive experience in deploying, securing, managing, and operating Linux systems in globally distributed production environments
Good knowledge of containers with hands-on experience in deploying, managing, and troubleshooting Kubernetes clusters and components in private data centers as well as public cloud
Proficient with at least one modern programming language (Python preferred) and the willingness to learn new languages as required
A systematic problem-solving approach to troubleshooting and the desire to solve the root cause of common problems in 24x7 environments
Preferred Qualifications
Deep understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues
Must have past experience deploying and managing systems and infrastructure in data centers, with the ability to debug and resolve recurring hardware issues.
Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, and ArgoCD
Good working knowledge of build automation and continuous integration / delivery
Knowledge of virtualization and multiple hypervisor technologies
Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK, etc. and the ability to identify new technologies as appropriate
Experience deploying applications and managing infrastructure in one or more public cloud providers (AWS, Azure, GCP) is highly desirable
Configuration and maintenance of web servers, load balancers, databases, storage systems and messaging systems
A passion to design for high availability and scale, with the discipline and desire for extensive automation
Strong communication skills with the ability and willingness to work with diverse teams and customers across multiple time zones
Preferred Qualifications
Experience working in a high-growth startup
A team player who demonstrates humility
Action-oriented with a focus on speed and results
Ability to thrive in a no-boundaries culture and make an impact on innovation
Benefits Summary for US-Based Full-Time Direct Employment Positions
(The Recruiter will provide benefit details for non-US-based roles)
SambaNova offers a competitive total rewards package, including the base salary, plus equity and benefits. We cover 95% premium coverage for employee medical insurance, and 77% premium coverage for dependents and offer a Health Savings Account (HSA) with employer contribution. We also offer Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans in addition to Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care. Our library of well-being benefits available to you and your dependents includes a full subscription to Headspace, Gympass+ membership with access to physical gyms, One Medical membership, counseling services with an Employee Assistance Program, and much more.

  • Bangalore, Karnataka, India Ultrabot Innovations Full time

    Position Overview :As a Senior Site Reliability Engineer with 5-8 years of experience, you will play a key role in ensuring the reliability, scalability, and performance of our systems and infrastructure. You will leverage your expertise in Site Reliability Engineering (SRE) to implement best practices and methodologies, effectively troubleshoot complex...


  • Bangalore, India Ultrabot Innovations Full time

    Position Overview :As a Senior Site Reliability Engineer with 5-8 years of experience, you will play a key role in ensuring the reliability, scalability, and performance of our systems and infrastructure. You will leverage your expertise in Site Reliability Engineering (SRE) to implement best practices and methodologies, effectively troubleshoot complex...


  • bangalore, India Oracle Full time

    Title: Senior Site Reliability Engineering Job Description :  Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team will focus on product development and product strategy for Oracle Health while building out a complete platform supporting modernized, automated healthcare....


  • bangalore, India SWAI TECHNOLOGIES PRIVATE LIMITED Full time

    Role : Senior Site reliability Engineer Exp : 5 to 10 Years of experience Remote Opportunity Company Description : Tech recruitment is broken Companies say there is a shortage of talent and it's hard to find good developers, while developers find it hard to find companies that value the skill, experience and passion they bring to the table.Quite the...


  • Bangalore, India SWAI TECHNOLOGIES PRIVATE LIMITED Full time

    Role : Senior Site reliability Engineer Exp : 5 to 10 Years of experience Remote Opportunity Company Description : Tech recruitment is broken Companies say there is a shortage of talent and it's hard to find good developers, while developers find it hard to find companies that value the skill, experience and passion they bring to the table.Quite...


  • Bangalore, Karnataka, India SWAI TECHNOLOGIES PRIVATE LIMITED Full time

    Role : Senior Site reliability Engineer Exp : 5 to 10 Years of experience Remote Opportunity Company Description :Tech recruitment is broken Companies say there is a shortage of talent and it's hard to find good developers, while developers find it hard to find companies that value the skill, experience and passion they bring to the table.Quite the...


  • bangalore, India We IT Global AB Full time

    Job Description This is a remote position. Join Our Team as a Senior SRE Engineer Are you a seasoned Senior Site Reliability Engineer with robust hands-on experience in GCP & Azure, coupled with a strong background in data management? Look no further!Key Responsibilities: Collaborate closely with our team to provide technical guidance and leadership....


  • bangalore, India Oracle Full time

    Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team will focus on product development and product strategy for Oracle Health while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial...


  • bangalore, India Mimecast Full time

    Senior Devops/Site Reliability Engineer (Cloud and Containerization) – Platform Devops Team   The driving force behind Platform Devops Team at Mimecast Dive into Platform DevOps team to drive efficiency and excellence across our platforms. Our team collaborates with engineering teams to expedite end-to-end delivery lifecycles and streamline workload...


  • bangalore, India Laerdal Bangalore Full time

    As a Senior Site Reliability Engineer, you’ll play a pivotal role in ensuring the reliability and performance of our cloud-based applications and solutions. Collaborating closely with our team, you will cultivate a culture of SRE breaking down silos and managing incidents and problems. Your role will involve developing and implementing innovative solutions...


  • bangalore, India Mimecast Full time

    Site Reliability Engineers - Senior & Principal (Hybrid)   We are recruiting for a number of Site Reliability Engineers to work cross-functionally on the latest cloud infrastructure and platforms to build services providing security for collaboration suites in Bangalore, India.  We’re expanding our global footprint and Bangalore offers a clear...


  • Bangalore/Hyderabad, India Nilasu consulting Full time

    Job Title : Senior Site Reliability Engineer (SRE)Department : Cloud EngineeringJob Type : Full-timeJob Description:We are seeking a highly skilled Senior Site Reliability Engineer (SRE) with extensive experience in Cloud Engineering, particularly in AWS. The ideal candidate should have hands-on expertise in developing Cloud solutions using Terraform or...


  • bangalore, India Oracle Full time

    Title: Senior Database Site Reliability Engineer Job Description :  Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team will focus on product development and product strategy for Oracle Health while building out a complete platform supporting modernized, automated...


  • bangalore, India Oracle Full time

    Title: Senior Database Site Reliability Engineer Job Description :  Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team will focus on product development and product strategy for Oracle Health while building out a complete platform supporting modernized, automated...


  • bangalore, India JFrog Full time

    Senior Site Reliability Engineer Bangalore, India | Production Share position At JFrog, we’re reinventing DevOps to help the world’s greatest companies innovate -- and we want you along for the ride. This is a special place with a unique combination of brilliance, spirit and just all-around great people. Here, if you’re willing to do more, your...


  • bangalore, India Cyitechsearch Full time

    We are hiring for Site Reliability Engineer Skills : - Develop and provide operational support for full-stack software applications.- Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation.- Five years' experience as a site reliability engineer or similar role.- Collaborate with development operations staff to...


  • bangalore, India First American (India) Full time

    The Role:A SRE Manager is ultimately responsible for system reliability, developer productivity and reducing time to market by striving to reduce technical debt of the services your SRE team supports. We seek managers who are passionate about site reliability to influence and drive the strategic SRE mission.As a Site Reliability Engineering Manager working...


  • Bangalore City, India Laerdal Bangalore Full time

    As a Senior Site Reliability Engineer, you’ll play a pivotal role in ensuring the reliability and performance of our cloud-based applications and solutions. Collaborating closely with our team, you will cultivate a culture of SRE breaking down silos and managing incidents and problems. Your role will involve developing and implementing innovative solutions...


  • Bangalore, Karnataka, India Cyitechsearch Full time

    We are hiring for Site Reliability Engineer Skills : - Develop and provide operational support for full-stack software applications.- Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation.- Five years' experience as a site reliability engineer or similar role.- Collaborate with development operations staff to...


  • Bangalore, India Cyitechsearch Full time

    We are hiring for Site Reliability Engineer Skills : - Develop and provide operational support for full-stack software applications.- Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation.- Five years' experience as a site reliability engineer or similar role.- Collaborate with development operations staff...