Systems/Network Engineer

8 hours ago


India BitOoda Full time

Systems/Network Engineer Role Overview



Overview

As a Systems/Network Engineer, you will be responsible for architecting, deploying, and maintaining GPU-based compute infrastructure. You will work on bare-metal systems, high-speed networks, and hybrid cloud integrations to ensure maximum performance, reliability, and scalability. This role is primarily remote but may occasionally require on-site support for hardware installations or emergency maintenance.

Key Responsibilities

System Optimization

  • Configure and optimize bare-metal servers, including Linux OS, NVIDIA/AMD GPU drivers, and system libraries.
  • Fine-tune NUMA settings, CPU-GPU affinity, and storage I/O for peak performance.
  • Benchmark and tune HPC systems for specific workloads, ensuring sustained high performance.

GPU Cluster Management

  • Deploy and manage GPU clusters using job orchestration tools like Kubernetes, Slurm, or similar platforms.
  • Monitor GPU utilization, thermals, and overall system health using tools like NVIDIA DCGM, ROCm, and Prometheus/Grafana.

Networking

  • Design and maintain high-speed networking solutions (e.g., NVLink, InfiniBand, RDMA) for distributed GPU systems.
  • Optimize data transfer between nodes and reduce latency in cluster communication.

Storage Solutions

  • Manage and configure storage solutions such as NVMe, SSD arrays, Ceph, or Lustre for high-throughput workloads.

Automation

  • Automate system deployment, updates, and monitoring using tools like Ansible, Terraform, or Python scripts.

Security

  • Implement secure access controls, firewalls, and VPNs to protect GPU resources and user data.
  • Ensure compliance with security best practices for HPC environments.

Hybrid/Cloud Integration

  • Manage integrations between on-premise GPU clusters and cloud platforms (e.g., AWS, GCP, Azure).
  • Build and maintain hybrid HPC setups for seamless scalability.

Data Center Infrastructure

  • Work on power, cooling, and rack design for HPC setups, ensuring reliable and efficient operations.
  • Deploy and maintain systems in on-premise or hybrid cloud data center environments.

Required Qualifications

Technical Skills

  • Strong experience with Linux (CentOS, Ubuntu, RHEL) and system-level configuration.
  • Expertise in managing NVIDIA GPU ecosystems (CUDA, NVLink, NVIDIA drivers).
  • Familiarity with AMD ROCm, HIP, or OpenCL for AMD GPUs.
  • Knowledge of high-speed networking protocols (InfiniBand, RDMA, Ethernet).
  • Proficiency in scripting and automation (Python, Bash, Ansible, Terraform).
  • Experience with job orchestration tools like Kubernetes or Slurm.
  • Familiarity with containerization (Docker, NVIDIA Docker, Singularity).
  • Understanding of storage technologies, including NVMe and parallel file systems.

Soft Skills

  • Strong analytical and problem-solving skills.
  • Ability to work independently and as part of a remote team.
  • Excellent communication skills for cross-team collaboration.

Preferred Qualifications

  • Experience with hybrid cloud setups, including AWS Outposts, Azure Stack, or GCP Anthos.
  • Hands-on experience with hardware management tools like IPMI/BMC for remote server management.
  • Familiarity with emerging accelerators (e.g., SambaNova, Cerebras, Graphcore).

What We Offer

  • Competitive salary and benefits package.
  • Work with a talented and collaborative team of engineers.
  • Opportunities to work on cutting-edge GPU and HPC projects.
  • A flexible and dynamic startup environment where you can grow and innovate.
  • Opportunities for professional development and continuous learning.


  • India IP Infusion Full time

    Company OverviewIP Infusion is a leading provider of network operating systems and solutions.Job SummaryWe are seeking an experienced Quality Assurance Specialist to join our team. The successful candidate will be responsible for ensuring the quality and reliability of our network operating systems.About YouYou have a degree in Electrical Engineering,...


  • India BCE Global Tech Full time

    Years of Exp 5 – 8 Years Key Responsibilities 1.VRF Routing, BGP, ISIS, Segment Routing, Industry Change Management 2 VMWare Datacenter, VCP 3. Understanding of extensive Routing Protocol - Cisco 4. MCSA (Microsoft Certified System Admin)5. Network Troubleshooting Required: BS Degree in Computer Science or Information Science 5 years of technical...

  • System Engineer

    1 month ago


    india VSquare Systems Pvt. Ltd. Full time

    Job Title: System Engineer Job Duties - Design cloud system based on client's requirement by utilizing cloud services and tools such as AWS, Microsoft Azure or Google Cloud. - Work on cloud platform setup, implementation and administration - Integrate or deploy third party projects into system - Administrate, configure and troubleshoot applications in Linux...


  • india BCE Global Tech Full time

    Years of Exp 5 – 8 YearsKey Responsibilities1.VRF Routing, BGP, ISIS, Segment Routing, Industry Change Management2 VMWare Datacenter, VCP3.Understanding of extensive Routing Protocol - Cisco4. MCSA (Microsoft Certified System Admin)5. Network TroubleshootingRequired:BS Degree in Computer Science or Information Science5 years of technical experience in (...


  • India SambaNova Systems Full time

    About SambaNova SystemsWe are a leading technology company at the forefront of AI and machine learning innovation. Our cutting-edge system software solutions empower businesses to drive transformation and growth.Estimated Salary: $250,000 - $350,000 per yearJob OverviewThis role presents a unique opportunity to shape and work on high-performance system...


  • india SambaNova Systems Full time

    Working at SambaNovaThis role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...


  • india Sky Systems, Inc. (SkySys) Full time

    Job descriptionJob Title: Network and Security EngineerJob Description:We are seeking a skilled Network and Security Engineer with hands-on experience in managing and securing network infrastructures. This role requires a candidate with expertise in Palo Alto Networks and Checkpoint firewall products, as well as proficiency in Cisco ASA firewalls and cloud...


  • India Sky Systems, Inc. (SkySys) Full time

    Job description Job Title: Network and Security Engineer Job Description: We are seeking a skilled Network and Security Engineer with hands-on experience in managing and securing network infrastructures. This role requires a candidate with expertise in Palo Alto Networks and Checkpoint firewall products, as well as proficiency in Cisco ASA...


  • india SambaNova Systems Full time

    Working at SambaNova This role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...


  • india SambaNova Systems Full time

    Working at SambaNova This role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...


  • india 1X2 Network Full time

    🚀 We're Hiring: Senior Software Engineer (Ready to Join Immediately)Job Role: Senior Software EngineerLocation: Hyderabad (On-Site)Experience: 4+YearsAbout 1X2 Network1X2 Network has pioneered iGaming development for over 2 decades and now encompasses a number of game development studio subsidiaries. We supply a broad array of over 750 games to Casino...


  • India Andela Full time

    About AndelaAndela is a global talent network that connects brilliant technologists with opportunities to accelerate their careers. With a focus on remote-first teams, we're dedicated to breaking down barriers and accelerating the future of work.Our MissionWe empower technologists to grow professionally and personally by providing access to a global...


  • india Persistent Systems Full time

    About Position:You will be essential in maintaining and enhancing the scalability, performance, and reliability of our systems. Your responsibilities will include collaborating with the product development team to design, build, and manage the infrastructure and tools necessary to support our software and ensure consistent uptime for our customers. Role: SRE...


  • india Ubique Systems Full time

    Responsible for managing capacity across public and private cloud resource pools, including automating scale-down/-up of environments.Improve cloud product reliability, availability, maintainability, and cost/benefit—including developing fault-tolerant tools to ensure the general robustness of the cloud infrastructure.Design and implement CI/CD pipeline...


  • india DRC Systems Full time

    We are searching for a skilled and experienced DevOps Engineer to join our growing team. In this role, you will play a pivotal role in bridging the gap between development and operations, ensuring a smooth and efficient software delivery lifecycle. You will be responsible for automating processes, building and maintaining infrastructure, and collaborating...


  • india Yield Engineering Systems Full time

    YES (Yield Engineering Systems, Inc.) is a leading manufacturer of reliable, high-tech, cost-effective capital equipment that transforms materials and surfaces at the nanoscale. From startups to the Fortune 50, our customers rely on YES to help them unleash products that change lives – from cell phones and IoT devices to AI and virtual reality, to...


  • india Ubique Systems Full time

    Responsible for managing capacity across public and private cloud resource pools, including automating scale-down/-up of environments. Improve cloud product reliability, availability, maintainability, and cost/benefit—including developing fault-tolerant tools to ensure the general robustness of the cloud infrastructure. Design and implement CI/CD pipeline...


  • india Ubique Systems Full time

    Responsible for managing capacity across public and private cloud resource pools, including automating scale-down/-up of environments. Improve cloud product reliability, availability, maintainability, and cost/benefit—including developing fault-tolerant tools to ensure the general robustness of the cloud infrastructure. Design and implement CI/CD pipeline...

  • Network engineer

    4 weeks ago


    India Vervent Full time

    Summary: The Network Engineer (NE) plays a pivotal role in designing, maintaining, and ensuring the reliability of all networking systems across the organization's locations. This individual is responsible for the design and maintenance of new equipment and software-defined network layouts, as well as hardware and software upgrades to optimize...


  • India MegaNucleus Full time

    This is a unique opportunity to work as a Network Systems Technician at MegaNucleus, where you will play a critical role in ensuring the smooth operation of our IT infrastructure.**Job Summary:**We are seeking an experienced and skilled individual to join our team as a Network Systems Technician. The ideal candidate will have a strong understanding of...