Systems/Network Engineer – High-Performance Compute GPU Infrastructure

1 day ago


Delhi, India BitOoda Full time
Role Overview

As a Systems/Network Engineer, you will be responsible for architecting, deploying, and maintaining GPU-based compute infrastructure. You will work on bare-metal systems, high-speed networks, and hybrid cloud integrations to ensure maximum performance, reliability, and scalability. This role is primarily remote but may occasionally require on-site support for hardware installations or emergency maintenance.

Key Responsibilities:

System Optimization

- Configure and optimize bare-metal servers, including Linux OS, NVIDIA/AMD GPU drivers, and system libraries.- Fine-tune NUMA settings, CPU-GPU affinity, and storage I/O for peak performance.- Benchmark and tune HPC systems for specific workloads, ensuring sustained high performance.

GPU Cluster Management

- Deploy and manage GPU clusters using job orchestration tools like Kubernetes, Slurm, or similar platforms.- Monitor GPU utilization, thermals, and overall system health using tools like NVIDIA DCGM, ROCm, and Prometheus/Grafana.

Networking

- Design and maintain high-speed networking solutions (e.g., NVLink, InfiniBand, RDMA) for distributed GPU systems.- Optimize data transfer between nodes and reduce latency in cluster communication.

Storage Solutions

- Manage and configure storage solutions such as NVMe, SSD arrays, Ceph, or Lustre for high-throughput workloads.

Automation

- Automate system deployment, updates, and monitoring using tools like Ansible, Terraform, or Python scripts.

Security

- Implement secure access controls, firewalls, and VPNs to protect GPU resources and user data.- Ensure compliance with security best practices for HPC environments.

Hybrid/Cloud Integration

- Manage integrations between on-premise GPU clusters and cloud platforms (e.g., AWS, GCP, Azure).- Build and maintain hybrid HPC setups for seamless scalability.

Data Center Infrastructure

- Work on power, cooling, and rack design for HPC setups, ensuring reliable and efficient operations.- Deploy and maintain systems in on-premise or hybrid cloud data center environments.

Required Qualifications

Technical Skills

- Strong experience with Linux (CentOS, Ubuntu, RHEL) and system-level configuration.- Expertise in managing NVIDIA GPU ecosystems (CUDA, NVLink, NVIDIA drivers).- Familiarity with AMD ROCm, HIP, or OpenCL for AMD GPUs.- Knowledge of high-speed networking protocols (InfiniBand, RDMA, Ethernet).- Proficiency in scripting and automation (Python, Bash, Ansible, Terraform).- Experience with job orchestration tools like Kubernetes or Slurm.- Familiarity with containerization (Docker, NVIDIA Docker, Singularity).- Understanding of storage technologies, including NVMe and parallel file systems.

Soft Skills

- Strong analytical and problem-solving skills.- Ability to work independently and as part of a remote team.- Excellent communication skills for cross-team collaboration.

Preferred Qualifications

- Experience with hybrid cloud setups, including AWS Outposts, Azure Stack, or GCP Anthos.- Hands-on experience with hardware management tools like IPMI/BMC for remote server management.- Familiarity with emerging accelerators (e.g., SambaNova, Cerebras, Graphcore).

What We Offer

- Competitive salary and benefits package.- Work with a talented and collaborative team of engineers.- Opportunities to work on cutting-edge GPU and HPC projects.- A flexible and dynamic startup environment where you can grow and innovate.- Opportunities for professional development and continuous learning.

  • Delhi, India BitOoda Full time

    Role OverviewAs a Systems/Network Engineer, you will be responsible for architecting, deploying, and maintaining GPU-based compute infrastructure. You will work on bare-metal systems, high-speed networks, and hybrid cloud integrations to ensure maximum performance, reliability, and scalability. This role is primarily remote but may occasionally require...


  • Delhi, Delhi, India BitOoda Full time

    Job Overview:As a Systems/Network Engineer, you will be responsible for architecting, deploying, and maintaining high-performance compute infrastructure leveraging NVIDIA GPUs. This role involves working on bare-metal systems, high-speed networks, and hybrid cloud integrations to ensure maximum performance, reliability, and scalability.Key...


  • Delhi, Delhi, India Vivekananda Institute of Professional Studies Full time

    About the JobAt Vivekananda Institute of Professional Studies, we are seeking a highly skilled and dedicated Data Center Engineer (NVIDIA Specialist) to join our team. This role involves the management, optimization, and maintenance of data center hardware and systems, with a specific focus on NVIDIA technologies such as GPUs and AI/ML infrastructure.Key...


  • Delhi, India BitOoda Full time

    Job Posting: GPU Optimization Engineer (Bare Metal Expertise)Location: RemoteJob Type: Full-TimeAbout UsWe are an innovative company at the forefront of high-performance computing (HPC) and AI, building cutting-edge solutions powered by GPUs and specialized accelerators. We’re looking for a highly skilled GPU Optimization Engineer to design, develop, and...


  • Delhi, India BitOoda Full time

    Job Posting: GPU Optimization Engineer (Bare Metal Expertise)Location:RemoteJob Type:Full-TimeAbout UsWe are an innovative company at the forefront of high-performance computing (HPC) and AI, building cutting-edge solutions powered by GPUs and specialized accelerators. We’re looking for a highly skilled GPU Optimization Engineer to design, develop, and...


  • delhi, India DC Tech Consulting Full time

    Job Profile: Senior Systems Engineer - Kubernetes & Linux PlatformSummary:An experienced Systems Engineer with over 10 years of specialized expertise in Linux platforms, Kubernetes cluster management, and advanced troubleshooting. Skilled in Kubernetes Day 2 operations, Linux networking, Linux storage, and Nvidia GPU configurations within Kubernetes...


  • Delhi, India DC Tech Consulting Full time

    Job Profile: Senior Systems Engineer - Kubernetes & Linux PlatformSummary:An experienced Systems Engineer with over 10 years of specialized expertise in Linux platforms, Kubernetes cluster management, and advanced troubleshooting. Skilled in Kubernetes Day 2 operations, Linux networking, Linux storage, and Nvidia GPU configurations within Kubernetes...


  • Delhi, India DC Tech Consulting Full time

    Job Profile: Senior Systems Engineer - Kubernetes & Linux PlatformSummary:An experienced Systems Engineer with over 10 years of specialized expertise in Linux platforms, Kubernetes cluster management, and advanced troubleshooting. Skilled in Kubernetes Day 2 operations, Linux networking, Linux storage, and Nvidia GPU configurations within Kubernetes...


  • Delhi, Delhi, India LinkedIn Full time

    As a Cloud-Native Systems Developer at LinkedIn, you will play a crucial role in building the next-generation infrastructure platforms. With a focus on information retrieval (IR), you will be part of a high-performing team that develops distributed databases built using Rust to support multiple retrieval use cases.Key ResponsibilitiesDesign and build highly...


  • Delhi, India ClearML Full time

    Information Technology Manager, AI ComputingCompany DescriptionClearML is a unified, open source platform for continuous AI/ML, trusted by forward-thinking Data Scientists, ML Engineers, DevOps, and decision makers at leading Fortune 500, enterprises, academia, and innovative start-ups worldwide. We enable customers to achieve the fastest time to production,...


  • Delhi, Delhi, India Tykhe Inc Full time

    Job Title: High-Performance Backend Systems EngineerAbout Us:Tykhe Inc is a cutting-edge company at the forefront of Generative Artificial Intelligence (GenAI). We're seeking an exceptional Product/Software Engineer-Backend to join our team in shaping the future of GenAI. This role offers exciting opportunities to work closely with cross-functional teams and...


  • Delhi, Delhi, India Mulya Technologies Full time

    Mulya Technologies Seeks Experienced ProfessionalWe are currently looking for a highly skilled Senior Microarchitecture Designer for High-Performance Systems to join our team at Mulya Technologies.About the RoleDesign and integrate high-performance System on Chip, architecting SoCs for power, performance, and area efficiency.Develop microarchitecture and...


  • Delhi, Delhi, India AryaXAI Full time

    AryaXAI is a pioneer in AI innovation, driving the development of explainable, safe, and aligned systems for mission-critical businesses.We are seeking a highly skilled High-Performance AI Developer to join our team and push the boundaries of high-performance AI computation. In this role, you will design, develop, and optimize GPU kernels that power...


  • Delhi, Delhi, India Mulya Technologies Full time

    High-Performance SoC Design EngineerWe are seeking a highly skilled Senior ASIC Design Engineer to join our team at Mulya Technologies in Santa Clara, California.About the Role:We are looking for candidates with expertise in Arm IP background, specifically CHI, CMN, and Arm CPUs.The ideal candidate will have experience designing and integrating...


  • Delhi, Delhi, India Gruve Full time

    OverviewGruve is an innovative Software Services startup dedicated to empowering Enterprise Customers in managing their Data Life Cycle. We specialize in Cyber Security, Customer Experience, Infrastructure, and advanced technologies such as Machine Learning and Artificial Intelligence.Salary: $120,000 - $180,000 per annum (dependent on experience)About the...


  • Delhi, India 2gethr Full time

    About 2gethr : More than a co-working delight, 2gethr is the tale of creating a space for individuals & companies to chase their dreams & make them happen.2gethr has to offer a combination of three elements—home, work & leisure. What we wanted from our space was to stir emotions within our members & employees; to become an emblem of dream starter.Our...


  • Delhi, India Vivekananda Institute Of Professional Studies Full time

    About the JobTitle: Data Centre Engineer (NVIDIA Specialist)Reports to: Director GeneralLocation: VIPS Campus, DelhiApply by: 20th December, 2024About VIPS: Summary:We are seeking a highly skilled and dedicated Data Center Engineer (NVIDIA Specialist) to join our team. This role involves the management, optimization, and maintenance of data center...


  • Delhi, India Vivekananda Institute of Professional Studies Full time

    About the JobTitle: Data Centre Engineer (NVIDIA Specialist)Reports to: Director GeneralLocation: VIPS Campus, DelhiApply by: 20th December, 2024About VIPS: Summary:We are seeking a highly skilled and dedicated Data Center Engineer (NVIDIA Specialist) to join our team. This role involves the management, optimization, and maintenance of data center...


  • Greater Delhi Area, India ClearML Full time

    Information Technology Manager, AI Computing Company Description ClearML is a unified, open source platform for continuous AI/ML, trusted by forward-thinking Data Scientists, ML Engineers, DevOps, and decision makers at leading Fortune 500, enterprises, academia, and innovative start-ups worldwide. We enable customers to achieve the fastest time to...


  • Greater Delhi Area, India ClearML Full time

    Information Technology Manager, AI ComputingCompany DescriptionClearML is a unified, open source platform for continuous AI/ML, trusted by forward-thinking Data Scientists, ML Engineers, DevOps, and decision makers at leading Fortune 500, enterprises, academia, and innovative start-ups worldwide. We enable customers to achieve the fastest time to production,...