Distributed Systems Engineer

2 weeks ago


India Restored Cloud Full time

As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B parameters.


Key Responsibilities:


  • Develop and scale distributed systems tailored for high-performance AI/ML workloads, focusing on eliminating delays caused by traditional checkpointing.
  • Design fault-tolerant and high-availability systems that ensure seamless operation and rapid recovery, even during infrastructure failures.
  • Implement advanced data partitioning, synchronization, and parallel computation techniques to handle terabytes of data and optimize memory usage across multi-node setups.
  • Collaborate with ML and infrastructure engineers to design innovative solutions for distributed training and inference of large-scale models.
  • Identify and resolve performance bottlenecks, particularly those arising from storage, memory, or network constraints in AI workflows.
  • Stay at the forefront of emerging distributed computing trends, such as zero-copy memory sharing, efficient in-memory data storage, and distributed model execution, to ensure your solutions remain cutting-edge.
  • Ability to adapt to new technologies and take on new responsibilities and roles in a fast-paced growing company. 



Minimum Qualifications:


  • Bachelor's degree in Computer Science, Distributed Systems, Computer Engineering, or a related field.
  • 5+ years of experience in designing and implementing distributed systems.
  • Proficiency in programming languages such as Python, C++, or Java.
  • Strong understanding of distributed computing principles, including fault tolerance, synchronization, and parallel computation.
  • Experience with distributed training frameworks such as PyTorch Distributed, TensorFlow Distributed, or DeepSpeed.
  • Familiarity with cloud platforms (AWS, GCP, or Azure) and managing multi-node infrastructure.
  • Demonstrated ability to troubleshoot performance bottlenecks in distributed systems.


Preferred Qualifications:


  • Master’s or Ph.D. in Computer Science, Distributed Systems, Computer Engineering, or a related field.
  • 7+ years of hands-on experience with large-scale distributed systems for AI/ML workloads.
  • Expertise in advanced distributed systems concepts, such as zero-copy memory sharing, RDMA, and NVMe-based storage.
  • Experience working at Nvidia, AMD, AWS, or a similar distributed systems-focused organization.
  • Proven track record of optimizing distributed systems for AI/ML models with 1B+ parameters.
  • Strong knowledge of network optimization techniques for high-performance computing.
  • Familiarity with cutting-edge AI/ML trends and the ability to integrate them into distributed architectures.


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1 B...


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • india Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • india Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • India Persistent Systems Full time

    About Position: We are on the lookout for a seasoned Cloud Database Administrator with a specialized focus on distributed database systems and a minimum of 5 years of experience. The successful candidate will be instrumental in managing, scaling, and ensuring the reliability of our distributed databases deployed on cloud infrastructure, with a particular...


  • India SambaNova Systems Full time

    About SambaNova SystemsWe are a leading technology company at the forefront of AI and machine learning innovation. Our cutting-edge system software solutions empower businesses to drive transformation and growth.Estimated Salary: $250,000 - $350,000 per yearJob OverviewThis role presents a unique opportunity to shape and work on high-performance system...


  • India RAPIDFORT Full time

    Job SummaryRapidfort is seeking an experienced Cloud Native Systems Engineer to design, implement, and optimize scalable systems for large-scale applications and data processing workflows.The ideal candidate will have a strong background in Python, Linux, and distributed computing. Experience with Docker and Kubernetes (K8s) is required.


  • India AiDASH Full time

    About the RoleWe are seeking a highly skilled Senior Software Engineer to join our team at AiDash. As a key member of our engineering team, you will be responsible for designing and building scalable distributed systems that support our mission to make critical infrastructure industries climate-resilient and sustainable.Key ResponsibilitiesDesign, develop,...


  • India Ai Palette Full time

    About the JobAi Palette is a cutting-edge Food AI company on a mission to revolutionize the industry with its innovative SaaS platform.We are seeking an experienced Senior Distributed Systems Architect to join our team in Singapore. The ideal candidate will have a strong background in designing and developing large-scale distributed systems, as well as...


  • India Exasoft Full time

    About ExasoftWe are a leading technology company with a strong presence in Singapore. Our team is passionate about delivering innovative solutions that meet the evolving needs of our clients.Salary and BenefitsWe offer an attractive salary package, estimated to be around SGD 180,000 per annum, plus additional benefits that include comprehensive health...


  • India Grid Dynamics Full time

    Job SummaryWe are seeking an experienced Data Engineer to join our team at Grid Dynamics. The ideal candidate will have a strong background in designing and developing large-scale applications using open-source technologies, with a focus on Big Data technologies like Hadoop, Spark, Hive, and MapReduce.About the RoleThis is a hybrid position based in...


  • Anywhere in India/Multiple Locations ca-one tech cloud inc Full time

    Job Title: Distributed Systems Engineer - Kafka and Java ExpertAbout Us:At Ca-One Tech Cloud Inc., we are a dynamic team of experts who work on cutting-edge technology to deliver high-performance solutions. Our company culture values innovation, collaboration, and career growth.Estimated Salary: ₹20,00,000 - ₹30,00,000 per annumJob Description:We are...


  • India ExaTech Inc Full time

    ExaTech Inc is seeking an experienced Cloud Data Engineer to join our team in Chennai or Hyderabad, working onsite. This role involves designing and implementing ETL processes for our data warehouse using distributed databases, particularly Informatica Cloud (IICS).We are looking for someone with a strong background in data modeling, schema design, data...


  • India LinkedIn Full time

    Are you a skilled software engineer looking for a challenging role in a fast-paced environment?We have an exciting opportunity for a Senior Software Engineer to join our team at LinkedIn, based in Bangalore, India.About the RoleWe are seeking a highly experienced software engineer to design and develop scalable, high-volume performing, and reliable system...


  • india SambaNova Systems Full time

    Working at SambaNovaThis role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...


  • india Yield Engineering Systems Full time

    YES (Yield Engineering Systems, Inc.) is a leading manufacturer of reliable, high-tech, cost-effective capital equipment that transforms materials and surfaces at the nanoscale. From startups to the Fortune 50, our customers rely on YES to help them unleash products that change lives – from cell phones and IoT devices to AI and virtual reality, to...


  • India Neem Full time

    At Neem Consulting, we are seeking a highly skilled Backend Engineer to join our team in India, working remotely or in a hybrid setup. As a Senior Backend Engineer, you will play a key role in building and scaling our client platform, which is an award-winning innovative startup aiming to revolutionise collaboration in hybrid meetings.We have an estimated...


  • india SambaNova Systems Full time

    Working at SambaNova This role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...


  • india SambaNova Systems Full time

    Working at SambaNova This role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...