Compute Cluster SRE Engineer, GPU

1 month ago


Bengaluru, India NVIDIA Full time
For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning. This new model - where deep neural networks are trained to recognize patterns from massive amounts of data - has shown to be deeply effective at solving some of the most complex problems in everyday life.

Farm GPU compute cluster SRE works to maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline that demands knowledge across different systems, Slurm/LSF, Unix administration, scripting, capacity management,

and opensource technologies. Farm GPU SRE is responsible for developing the solution around our large compute cluster to make it work efficiently and improve the user experience for customer as well as engineers supporting the cluster.

Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to product quality and interesting and dynamic day-to-day work.

We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you will be doing:Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime.

Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.

Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management.

Support services before they go live through activities such as capacity management, providing best possible user support issues.

Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health.

Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

Practice sustainable incident response and blameless postmortems.

Understand complex and vast infrastructure and support it during on call weeks.

Work with different SME and help provide quality resolution to the production issues to the customer.

What we need to see:BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent.

3+ years of hands-on industry experience in the above-mentioned areas

Must have experience with

Linux system administration (Ubuntu , Centos/Redhat)

Must have HPC cluster scheduler experience in setup and administration like

SLURM

&/ LSF.

Experience in one or more of the following:

Python, Perl, Bash .

Good understanding of open-source IT Automation tools like

Ansible .

Interest in crafting, analyzing, and fixing large-scale distributed systems.

Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

Ability to debug and optimize code and automate routine tasks.

Ways to stand out of the crowd:Experience of Bright Cluster Manager ( BCM )

Understanding on

InfiniBand

or Ethernet concepts.

Experience with high-speed storage solutions such as

Lustre, GPFS.

Experience with MPI , Pytorch



  • Bengaluru, India NVIDIA Full time

    For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing...


  • Bengaluru, India DBSI Services Full time

    Job Title: CUDA Software Engineer (GPU Programming Specialist)Location: Pune and BengaluruWe are looking for a skilled CUDA Software Engineer to join our team and contribute to the development of high-performance parallel computing solutions. As a CUDA Software Engineer, you will play a key role in designing and implementing CUDA kernels for NVIDIA GPUs,...


  • Bengaluru, India DBSI Services Full time

    Job Title: CUDA Software Engineer (GPU Programming Specialist)Location: Pune and BengaluruWe are looking for a skilled CUDA Software Engineer to join our team and contribute to the development of high-performance parallel computing solutions. As a CUDA Software Engineer, you will play a key role in designing and implementing CUDA kernels for NVIDIA GPUs,...


  • Bengaluru, India DBSI Services Full time

    Job Title: CUDA Software Engineer (GPU Programming Specialist)Location: Pune and BengaluruWe are looking for a skilled CUDA Software Engineer to join our team and contribute to the development of high-performance parallel computing solutions. As a CUDA Software Engineer, you will play a key role in designing and implementing CUDA kernels for NVIDIA GPUs,...


  • Bengaluru, India Intel Corporation Full time

    Job DescriptionJob Role:Design verification and Performance verification of the GPU IP RTL involving TestBench development, test writing , architecture/functional modeling, functional coverage closure and debugDefine functional/performance verification strategy for RTL and develop execution test plans for the same.Understand and drive various GPU and Machine...


  • Bengaluru, India DBSI Services Full time

    Job Title: CUDA Software Engineer (GPU Programming Specialist)Location: Pune and BengaluruWe are looking for a skilled CUDA Software Engineer to join our team and contribute to the development of high-performance parallel computing solutions. As a CUDA Software Engineer, you will play a key role in designing and implementing CUDA kernels for NVIDIA GPUs,...

  • Gpu Architect

    1 week ago


    Bengaluru, Karnataka, India NVIDIA Full time

    NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the...

  • GPU Architect

    1 week ago


    Bengaluru, Karnataka, India Nvidia Full time

    NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the...

  • Gpu/cpu Rtl Design

    1 month ago


    Bengaluru, Karnataka, India Samsung Electronics Full time

    Position Summary We are currently looking for exceptional Senior hardware/RTL design talent to join our SSIR team for GPU Development. Role and Responsibilities As Senior engineers in the GPU RTL design team, you will work as part of a GPU IP development tasked with driving the RTL design of various sub-blocks for a GPU targeted to mobile market as well...

  • GPU Architect

    5 days ago


    Bengaluru, India Nvidia Full time

    NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the...

  • GPU Architect

    1 week ago


    Bengaluru, India NVIDIA Full time

    NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the...


  • Bengaluru, Karnataka, India Qualcomm Full time

    **Company**: Qualcomm India Private Limited **Job Area**: Engineering Group, Engineering Group > Systems Engineering **General Summary**: **General Summary**: Qualcomm is a company of inventors that unlocked 5G ushering in an age of rapid acceleration in connectivity and new possibilities that will transform industries, create jobs, and enrich lives. But...


  • Bengaluru, India Qualcomm Full time

    Company:Qualcomm India Private LimitedJob Area:Engineering Group, Engineering Group >Systems EngineeringGeneral Summary:Job DescriptionResponsibilities:This position will be responsible for research, analysis and improvement of Qualcomm's Adreno GPU compiler and system performance to our world wide customers. From the analyses and experiments on GPU shaders...

  • Staff IT SRE Engineer

    1 month ago


    Bengaluru, India NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Bengaluru, India NVIDIA Full time

    NVIDIA is seeking an outstanding Verification Engineer to verify the design and implementation of the next generation of the world’s leading GPUs. This position offers the opportunity to have real impact in a multifaceted, technology-focused company impacting product lines ranging from consumer graphics to self-driving cars and the growing field of...


  • Bengaluru, India Qualcomm Full time

    Company: Qualcomm India Private Limited Job Area: Engineering Group, Engineering Group > Systems Engineering General Summary: Job Description Responsibilities: This position will be responsible for research, analysis and improvement of Qualcomm's Adreno GPU compiler and system performance to our world wide customers. From the analyses and...


  • Bengaluru, India Intel Full time

    Job Description Develops and/or validates software that enables Intel GPUs. Scope can spans the entire stack, from firmware and device drivers through APIs and the application layer, and may also include the tools, infrastructure, and technologies necessary to develop, profile, optimize, and productize Intel GPUs or graphics/GPGPU software solutions. ...

  • Staff IT SRE Engineer

    2 months ago


    Bengaluru, India NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Bengaluru, India Intel Full time

    Job DescriptionThe software team in Client Graphics and AI organization is a world-class global and specialized team, on a mission to build the next generation of Video and Media experiences. We are seeking talented Cross-OS Software Development Engineers to join our team. In this role, you will be at the forefront of developing innovative software solutions...

  • Staff Engineer

    1 month ago


    Bengaluru, India Stryker Full time

    Why engineering at Stryker?At Stryker we are dedicated to improving lives, with a passion for researching and developing new medical device products. As an engineer at Stryker, you will be proud of the work that you will be doing, using cutting-edge technologies to make healthcare better. Here, you will work in a supportive culture with other incredibly...