High-Performance Infrastructure Specialist

1 week ago


Gurgaon, Haryana, India beBeeReliability Full time ₹ 15,00,000 - ₹ 20,00,000
Job Title:

Senior Site Reliability Engineer


Overview:

The successful candidate will be responsible for designing and implementing large-scale distributed systems with a focus on performance at scale, real-time monitoring, logging, and alerting. The ideal candidate will have a deep understanding of GPU computing and AI infrastructure.


Responsibilities:
  • Design and implement state-of-the-art GPU compute clusters.
  • Optimize cluster operations for maximum reliability, efficiency, and performance.
  • Drive foundational improvements and automation to enhance researcher productivity.
  • Troubleshoot, diagnose, and root cause system failures and isolate the components/failure scenarios while working with internal & external partners.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems.
  • Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world.

Requirements:
  • Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure.
  • Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster.
  • Deep understanding of GPU computing and AI infrastructure.
  • Passion for solving complex technical challenges and optimizing system performance.
  • Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm.
  • Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.
  • In-depth understanding of container technologies like Docker, Enroot, etc.
  • Experience programming in Python and Bash scripting.

Benefits:
  • Opportunity to work on cutting-edge technology and contribute to groundbreaking projects.
  • Collaborative and dynamic work environment with talented professionals.
  • Professional development and growth opportunities.
  • Competitive salary and benefits package.


  • Gurgaon, Haryana, India beBeeInfrastructure Full time ₹ 15,00,000 - ₹ 28,00,000

    Job DescriptionWe are seeking an experienced professional to fill the role of High-Performance Computing Engineer. The successful candidate will provide operational support for enterprise-level customers, planning and performing maintenance activities, assessing customer environments for performance and design issues, and collaborating with technical teams...


  • Gurgaon, Haryana, India beBeeNetwork Full time ₹ 15,00,000 - ₹ 28,00,000

    Expert Network Professionals are sought for the role of High-Performance Computing Network Engineer.This position requires a highly skilled individual with extensive experience in managing Network infrastructure in high-performance computing environments. The ideal candidate will have expertise in configuring, maintaining, and troubleshooting Nvidia/Mellanox...


  • Gurgaon, Haryana, India beBeeKafka Full time ₹ 15,00,000 - ₹ 25,00,000

    Job Title : High-Performance Messaging Systems SpecialistWe are seeking an experienced Kafka Administrator to manage, maintain, and optimize our distributed, multi-cluster Kafka infrastructure deployed in an on-premise environment. This role requires deep knowledge of Kafka internals, Zookeeper administration, performance tuning, and operational excellence...


  • Gurgaon, Haryana, India beBeeInfrastructure Full time ₹ 15,00,000 - ₹ 28,00,000

    Job Overview:We are seeking a talented HPC Infrastructure Specialist to join our team. In this role, you will provide expert-level operational support to customers for incident, problem, and change management activities.Key Responsibilities:Provide enterprise-level operational support to customers for incident, problem, and change management activitiesPlan...


  • Gurgaon, Haryana, India beBeeMarketing Full time ₹ 12,00,000 - ₹ 15,00,000

    Job Title:A high-performing Performance Marketing Specialist is required to plan and execute Paid Advertising campaigns on social media platforms.Key Responsibilities:Developing and implementing Paid Advertising strategies across multiple social media channelsOptimizing campaign performance through A/B testing and ROI analysisManaging programmatic...


  • Gurgaon, Haryana, India beBeeinfrastructure Full time ₹ 1,50,000 - ₹ 28,00,000

    System Infrastructure SpecialistWe are seeking an experienced System Infrastructure Specialist to join our team. As a key member of our infrastructure team, you will be responsible for the management and maintenance of high availability infrastructure.


  • Gurgaon, Haryana, India beBeeDevOps Full time ₹ 20,00,000 - ₹ 25,00,000

    AWS DevOps Engineer - Cloud Infrastructure SpecialistWe are seeking a seasoned cloud infrastructure specialist with robust experience in designing, deploying, and maintaining secure, scalable, and high-availability AWS environments.Design and manage AWS infrastructure, focusing on middleware services such as API Gateway, Lambda, SQS, SNS, ECS, and...


  • Gurgaon, Haryana, India beBeeReliability Full time ₹ 2,00,00,000 - ₹ 2,50,00,000

    Job OverviewWe are seeking an experienced Senior Reliability Engineer to ensure the reliability, availability, scalability, and performance of our Azure-based platforms and applications.Service Reliability & SLOs: Define and maintain Service Level Objectives (SLOs) for the systems you own.Automation & Scalability: Develop automation to scale systems...


  • Gurgaon, Haryana, India beBeeCloudInfrastructure Full time ₹ 20,09,917 - ₹ 25,12,756

    Job Title:Cloud Infrastructure SpecialistAbout the Role:This is an exciting opportunity to join our team as a Cloud Infrastructure Specialist. In this role, you will be responsible for designing, building, testing, and deploying cloud application solutions that integrate cloud and non-cloud infrastructure.Your primary focus will be on collaborating with...


  • Gurgaon, Haryana, India beBeeDatabaseAdministrator Full time ₹ 15,00,000 - ₹ 25,00,000

    Job SummaryWe are seeking a seasoned Database Administrator to join our team. As a key member of the database administration team, you will be responsible for designing, implementing, and maintaining high-performance database systems.Responsibilities include:Monitoring and troubleshooting database instancesPerforming root-cause analysis in response to...