Compute Cluster SRE Engineer, GPU

1 month ago


bangalore, India NVIDIA Full time

For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning. This new model - where deep neural networks are trained to recognize patterns from massive amounts of data - has shown to be deeply effective at solving some of the most complex problems in everyday life.

Farm GPU compute cluster SRE works to maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline that demands knowledge across different systems, Slurm/LSF, Unix administration, scripting, capacity management,  and opensource technologies. Farm GPU SRE is responsible for developing the solution around our large compute cluster to make it work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to product quality and interesting and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you will be doing:

  • Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime.

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.  

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management.

  • Support services before they go live through activities such as capacity management, providing best possible user support issues. 

  • Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health.

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

  • Practice sustainable incident response and blameless postmortems.

  • Understand complex and vast infrastructure and support it during on call weeks.

  • Work with different SME and help provide quality resolution to the production issues to the customer.

What we need to see:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent.

  • 3+ years of hands-on industry experience in the above-mentioned areas

  • Must have experience with Linux system administration (Ubuntu , Centos/Redhat)

  • Must have HPC cluster scheduler experience in setup and administration like SLURM &/ LSF.

  • Experience in one or more of the following: Python, Perl, Bash .

  • Good understanding of open-source IT Automation tools like Ansible .

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

  • Ability to debug and optimize code and automate routine tasks.

Ways to stand out of the crowd:

  • Experience of Bright Cluster Manager (BCM )

  • Understanding on InfiniBand or Ethernet concepts.

  • Experience with high-speed storage solutions such as Lustre, GPFS.

  • Experience with MPI , Pytorch


  • Staff IT SRE Engineer

    1 month ago


    bangalore, India NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...

  • Senior Chief Engineer

    4 weeks ago


    Bangalore, Karnataka, India Career Makers Full time

    Job Description :Role and Responsibilities :- Design and develop generic AI acceleration framework for GPUs, CPUs and NPUs.- Prune and optimize trained AI models of telco use cases. - Optimize Neural Network libraries to best adopt to underlying compute platform.- Implement and verify the generic AI framework and optimized models. - Profile on various...

  • Senior Chief Engineer

    1 month ago


    Bangalore, India Career Makers Full time

    Job Description : Role and Responsibilities :- Design and develop generic AI acceleration framework for GPUs, CPUs and NPUs.- Prune and optimize trained AI models of telco use cases. - Optimize Neural Network libraries to best adopt to underlying compute platform.- Implement and verify the generic AI framework and optimized models. - Profile on various...

  • SRE Engineer

    1 week ago


    bangalore, India Australia and New Zealand Banking Group Limited (ANZ) Full time

    SRE Engineer SRE Engineer Req ID: Department: Tech Pacific Division: Technology Location: Bengaluru About the role At ANZ our purpose is to shape a world where people and communities thrive. We’re making this happen by improving our customers’ financial wellbeing so they can achieve incredible things – be it buying their home, building...

  • Staff Engineer

    1 month ago


    bangalore, India Stryker Full time

    Why engineering at Stryker? At Stryker we are dedicated to improving lives, with a passion for researching and developing new medical device products. As an engineer at Stryker, you will be proud of the work that you will be doing, using cutting-edge technologies to make healthcare better. Here, you will work in a supportive culture...

  • Sr SRE Engineer

    3 weeks ago


    Bangalore Metropolitan Area, India UST Full time

    Responsibilities The engineer will enable clients to navigate and adoption of IT methodologies and operating models to drive business agility using SRE and Agile frameworks. As a SRE engineer, you will work closely with our clients to define clients’ operational and governance modelsDesign and deploy scalable, reliable, and secure SRE solutions. The ideal...


  • Bangalore, India Collabera Full time

    Role : As an SRE Database Development Engineer, you will be responsible for all aspects of database management, development, and optimization. You will work closely with software engineers, data engineers, and operations teams to ensure that our database systems meet the demands of our growing user base. Your primary goal will be to build and maintain robust...

  • Senior Architect

    2 weeks ago


    bangalore, India NVIDIA Full time

    The NVIDIA Architecture Modelling group is looking for architects and software developers to join our various architecture efforts. A key part of NVIDIA's strength is to innovate in the graphics and parallel computing fields, delivering the highest performance in the world for high-performance computing. We are constantly looking for ways to improve our GPU...


  • bangalore, India Infogain Full time

    SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 14, Share on Linkedin Share on Twitter Share on Facebook ROLES &...


  • bangalore, India Infogain Full time

    SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 16, Share on Linkedin Share on Twitter Share on Facebook ROLES &...

  • Application Engineer

    2 weeks ago


    bangalore, India MathWorks Full time

    Job Summary Job: 30670-RRAV Location: Department: Do you have HPC Cluster Admin experience at your university or company ? Do you support technical applications on HPC platforms ? ...


  • bangalore, India NVIDIA Full time

    NVIDIA is searching for a creative and highly motivated engineer with expertise in system software to join the Tegra System Software organization. This position offers the opportunity to have real impact in a dynamic, technology-focused company impacting product lines ranging from consumer graphics to self-driving cars and the growing field of artificial...

  • SRE Engineer

    1 week ago


    bangalore, India ANZ Full time

    About the role At ANZ our purpose is to shape a world where people and communities thrive. We’re making this happen by improving our customers’ financial wellbeing so they can achieve incredible things – be it buying their home, building a business or saving for things big or small.  Role Type : Permanent Work Location : Bengaluru...

  • SRE Architect

    2 weeks ago


    Bangalore, India Squareroot Consulting Pvt Ltd. Full time

    We are USA HQ well-funded Startup. In a process of setting up SRE Practice in Bangalore, India.Position : SRE ArchitectDomain : Data Security & Cyber Security Experience : 8+ Yrs Work Location : Bangalore, IndiaCompensation : 40 to 60 LPAWhat we are looking:- Excellent dealing with high-availability, fault-tolerant, scalable, resilient and distributed...

  • SRE Architect

    1 month ago


    bangalore, India Squareroot Consulting Pvt Ltd. Full time

    We are USA HQ well-funded Startup. In a process of setting up SRE Practice in Bangalore, India.Position : SRE ArchitectDomain : Data Security & Cyber Security Experience : 8+ Yrs Work Location : Bangalore, IndiaCompensation : 40 to 60 LPAWhat we are looking:- Excellent dealing with high-availability, fault-tolerant, scalable, resilient and distributed...

  • SRE Architect

    4 weeks ago


    Bangalore, Karnataka, India Squareroot Consulting Pvt Ltd. Full time

    We are USA HQ well-funded Startup. In a process of setting up SRE Practice in Bangalore, India.Position : SRE ArchitectDomain : Data Security & Cyber Security Experience : 8+ Yrs Work Location : Bangalore, IndiaCompensation : 40 to 60 LPAWhat we are looking: - Excellent dealing with high-availability, fault-tolerant, scalable, resilient and distributed...


  • bangalore, India JPMorgan Chase & Co. Full time

    Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the WM SRE team, you are an integral part of an agile team that works to enhance, build, and deliver trusted market-leading technology products in a secure,...


  • bangalore, India Spectrum Consultants India Private Limited Full time

    Staff Infrastructure SRE Engineer - Infrastructure support Summary Experience Required: 9 - 15 YearsJob Term: PermanentLocation: BangaloreCategory: Networking /System Administration /Technical SupportWorld leader in visual and AI Computing.For more than two decades, company has pioneered visual computing, the art and science of computer graphics. With a...

  • Platform SRE Engineer

    2 weeks ago


    bangalore, India DigiCert Full time

    ABOUT DIGICERT We're a leading, global security authority that's disrupting our own category. Our encryption is trusted by the major ecommerce brands, the world's largest companies, the major cloud providers, entire country financial systems, entire internets of things and even down to the little things like surgically embedded pacemakers. We help...


  • bangalore, India Virtusa Full time

    SRE with AIOP and Dynatrace - CREQ181002 Description Knowledge & Experience:Minimum of 6 years of relevant work experience in critical production environmentsExperience in enabling observability within applications to extract appropriate telemetry into suitable back ends like DynatraceHands-on experience of curating Service Level Objectives, defining Error...