Compute Cluster SRE Engineer

2 months ago

Bengaluru, India NVIDIA Full time

For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning. This new model - where deep neural networks are trained to recognize patterns from massive amounts of data - has shown to be deeply effective at solving some of the most complex problems in everyday life.

Farm GPU compute cluster SRE works to maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline that demands knowledge across different systems, Slurm/LSF, Unix administration, scripting, capacity management, and opensource technologies. Farm GPU SRE is responsible for developing the solution around our large compute cluster to make it work efficiently and improve the user experience for customer as well as engineers supporting the cluster. Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to product quality and interesting and dynamic day-to-day work. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you will be doing:

Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime.
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management.
Support services before they go live through activities such as capacity management, providing best possible user support issues.
Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Understand complex and vast infrastructure and support it during on call weeks.
Work with different SME and help provide quality resolution to the production issues to the customer.

What we need to see:

BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent.
3+ years of hands-on industry experience in the above-mentioned areas
Must have experience with Linux system administration(Ubuntu , Centos/Redhat)
Must have HPC cluster scheduler experience in setup and administration like SLURM &/ LSF.
Experience in one or more of the following: Python, Perl, Bash.
Good understanding of open-source IT Automation tools like Ansible.
Interest in crafting, analyzing, and fixing large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug and optimize code and automate routine tasks.

Ways to stand out of the crowd:

Experience of Bright Cluster Manager (BCM)
Understanding on InfiniBand or Ethernet concepts.
Experience with high-speed storage solutions such as Lustre, GPFS.
Experience with MPI , Pytorch

Compute Cluster SRE Engineer, GPU

1 month ago

Bengaluru, India NVIDIA Full time

For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing...
Sr SRE Engineer

1 month ago

Greater Bengaluru Area, India UST Full time

Responsibilities The engineer will enable clients to navigate and adoption of IT methodologies and operating models to drive business agility using SRE and Agile frameworks. As a SRE engineer, you will work closely with our clients to define clients’ operational and governance modelsDesign and deploy scalable, reliable, and secure SRE solutions. The ideal...
Sre Architect

1 month ago

Bengaluru, India CIEL HR Services Full time

Strong understanding and knowledge on SRE setup on GCP development environment. Understanding on monitor performance, resource utilization, and error logs when products move into production. Experience on SRE tool implementation (incident and configuration management tools) from scratch. Good grip on the foundational concepts of SRE (observability and...
Sre Architect

1 month ago

Bengaluru, India CIEL HR Services Full time

**JD**: Strong understanding and knowledge on SRE setup on GCP development environment. Understanding on monitor performance, resource utilization, and error logs when products move into production. Experience on SRE tool implementation (incident and configuration management tools) from scratch. Good grip on the foundational concepts of SRE (observability...
Architect - SRE

1 week ago

Bengaluru, India LTIMindtree Full time

SRE with good experience in setting up SRE practices and toolset for customers Good understanding of monitoring tools (New relic, App Dynamics etc.)good experience in handling large and complex production environmentsGood experience in Chaos engineering, Performance engineeringGood experience in handling reliability measures levering tools/frameworks.Good...
Architect - SRE

1 week ago

Bengaluru, India LTIMindtree Full time

SRE with good experience in setting up SRE practices and toolset for customers Good understanding of monitoring tools (New relic, App Dynamics etc.)good experience in handling large and complex production environmentsGood experience in Chaos engineering, Performance engineeringGood experience in handling reliability measures levering tools/frameworks.Good...
SRE / Reliability Engineer (Lead)

2 weeks ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 19, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
SRE / Reliability Engineer (Lead)

1 week ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 24, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
SRE / Reliability Engineer (Lead)

2 weeks ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 23, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
SRE / Reliability Engineer (Lead)

1 week ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 26, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
SRE / Reliability Engineer (Lead)

6 days ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 28, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
SRE / Reliability Engineer (Lead)

6 days ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 29, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
SRE / Reliability Engineer (Lead)

5 days ago

Bengaluru, India Infogain Full time

SRE / Reliability Engineer (Lead) with skills ITSM Principles, AWS - EKS, AWS - CloudFormation, SRE Architecture, AWS-Apps, GCP-Apps, AWS-Infra, SRE Engineering, AWS DBA for location Any Infogain Base Location (Noida, Gurugram, Bangalore, Mumbai, Pune) Posted on: May 30, Share on Linkedin Share on Twitter Share on Facebook ROLES & RESPONSIBILITIES ...
Devops Engineer

2 weeks ago

Bengaluru, India Sonata Software Full time

Job Title: Senior Site Reliability Engineer (SRE)Department: Cloud EngineeringJob Type: Full-timeJob Description:We are seeking a highly skilled Senior Site Reliability Engineer (SRE) with extensive experience in Cloud Engineering, particularly in AWS. The ideal candidate should have hands-on expertise in developing Cloud solutions using Terraform or Cloud...
Devops Engineer

2 weeks ago

Bengaluru, India Sonata Software Full time

Job Title: Senior Site Reliability Engineer (SRE)Department: Cloud EngineeringJob Type: Full-time Job Description:We are seeking a highly skilled Senior Site Reliability Engineer (SRE) with extensive experience in Cloud Engineering, particularly in AWS. The ideal candidate should have hands-on expertise in developing Cloud solutions using Terraform or...
Staff IT SRE Engineer

1 month ago

Bengaluru, India NVIDIA Full time

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...
SRE Architect

1 month ago

Bengaluru, India Squareroot Consulting Pvt Ltd. Full time

We are USA HQ well-funded Startup. In a process of setting up SRE Practice in Bangalore, India.Position : SRE ArchitectDomain : Data Security & Cyber Security Experience : 8+ Yrs Work Location : Bangalore, IndiaCompensation : 40 to 60 LPAWhat we are looking: - Excellent dealing with high-availability, fault-tolerant, scalable, resilient and distributed...
Staff IT SRE Engineer

2 months ago

Bengaluru, India NVIDIA Full time

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...
Senior Engineering Manager, SRE

4 days ago

Bengaluru, India Sumo Logic Full time

Want to lead a global team responsible for the most important product features – availability, reliability & security ? Sumo’s SRE program focuses on continual data-driven evolution and improvement of the reliability, security, and efficiency of our global scale technological presence. We are looking for a great leader with a passion for site...
Apply Now: Architect

1 week ago

Bengaluru, India LTIMindtree Full time

- SRE with good experience in setting up SRE practices and toolset for customers- Good understanding of monitoring tools (New relic, App Dynamics etc.)- good experience in handling large and complex production environments- Good experience in Chaos engineering, Performance engineering- Good experience in handling reliability measures levering...

Americas

Europe

Asia / Oceania

Africa

Compute Cluster SRE Engineer