Restored Cloud | Distributed Systems Engineer

2 weeks ago


india Restored Cloud Full time

As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B parameters.


Key Responsibilities:


  • Develop and scale distributed systems tailored for high-performance AI/ML workloads, focusing on eliminating delays caused by traditional checkpointing.
  • Design fault-tolerant and high-availability systems that ensure seamless operation and rapid recovery, even during infrastructure failures.
  • Implement advanced data partitioning, synchronization, and parallel computation techniques to handle terabytes of data and optimize memory usage across multi-node setups.
  • Collaborate with ML and infrastructure engineers to design innovative solutions for distributed training and inference of large-scale models.
  • Identify and resolve performance bottlenecks, particularly those arising from storage, memory, or network constraints in AI workflows.
  • Stay at the forefront of emerging distributed computing trends, such as zero-copy memory sharing, efficient in-memory data storage, and distributed model execution, to ensure your solutions remain cutting-edge.
  • Ability to adapt to new technologies and take on new responsibilities and roles in a fast-paced growing company. 



Minimum Qualifications:


  • Bachelor's degree in Computer Science, Distributed Systems, Computer Engineering, or a related field.
  • 5+ years of experience in designing and implementing distributed systems.
  • Proficiency in programming languages such as Python, C++, or Java.
  • Strong understanding of distributed computing principles, including fault tolerance, synchronization, and parallel computation.
  • Experience with distributed training frameworks such as PyTorch Distributed, TensorFlow Distributed, or DeepSpeed.
  • Familiarity with cloud platforms (AWS, GCP, or Azure) and managing multi-node infrastructure.
  • Demonstrated ability to troubleshoot performance bottlenecks in distributed systems.


Preferred Qualifications:


  • Master’s or Ph.D. in Computer Science, Distributed Systems, Computer Engineering, or a related field.
  • 7+ years of hands-on experience with large-scale distributed systems for AI/ML workloads.
  • Expertise in advanced distributed systems concepts, such as zero-copy memory sharing, RDMA, and NVMe-based storage.
  • Experience working at Nvidia, AMD, AWS, or a similar distributed systems-focused organization.
  • Proven track record of optimizing distributed systems for AI/ML models with 1B+ parameters.
  • Strong knowledge of network optimization techniques for high-performance computing.
  • Familiarity with cutting-edge AI/ML trends and the ability to integrate them into distributed architectures.


  • india Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1 B...


  • india Restored Cloud Full time

    Machine Learning Engineer - Infrastructure Job Description: As a Machine Learning Engineer specializing in infrastructure at Restored Cloud, you will design and build the tools, frameworks, and systems that enable efficient training, deployment, and scaling of machine learning models. You will work on cutting-edge challenges in model optimization,...


  • india Restored Cloud Full time

    Machine Learning Engineer - InfrastructureJob Description:As a Machine Learning Engineer specializing in infrastructure at Restored Cloud, you will design and build the tools, frameworks, and systems that enable efficient training, deployment, and scaling of machine learning models. You will work on cutting-edge challenges in model optimization,...


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • India Restored Cloud Full time

    As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...


  • India Restored Cloud Full time

    Job OverviewRestored Cloud is seeking a skilled Distributed Systems Engineer to design and optimize distributed infrastructure for large-scale AI/ML model training and inference. As a key member of our team, you will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models...


  • India Restored Cloud Full time

    Machine Learning Engineer - Infrastructure Job Description: As a Machine Learning Engineer specializing in infrastructure at Restored Cloud, you will design and build the tools, frameworks, and systems that enable efficient training, deployment, and scaling of machine learning models. You will work on cutting-edge challenges in model optimization,...


  • India Restored Cloud Full time

    At Restored Cloud, we are seeking an experienced Cloud Infrastructure Machine Learning Architect to design and build cutting-edge tools, frameworks, and systems for efficient machine learning model training, deployment, and scaling.The ideal candidate will have a strong background in cloud infrastructure, machine learning, and software development....


  • India Restored Cloud Full time

    Machine Learning Engineer - Infrastructure Job Description: As a Machine Learning Engineer specializing in infrastructure at Restored Cloud, you will design and build the tools, frameworks, and systems that enable efficient training, deployment, and scaling of machine learning models. You will work on cutting-edge challenges in model optimization,...


  • India Restored Cloud Full time

    Machine Learning Engineer - InfrastructureJob Description:As a Machine Learning Engineer specializing in infrastructure at Restored Cloud, you will design and build the tools, frameworks, and systems that enable efficient training, deployment, and scaling of machine learning models. You will work on cutting-edge challenges in model optimization,...


  • India Persistent Systems Full time

    About Position: We are on the lookout for a seasoned Cloud Database Administrator with a specialized focus on distributed database systems and a minimum of 5 years of experience. The successful candidate will be instrumental in managing, scaling, and ensuring the reliability of our distributed databases deployed on cloud infrastructure, with a particular...


  • Anywhere in India/Multiple Locations ca-one tech cloud inc Full time

    Job Title: Distributed Systems Engineer - Kafka and Java ExpertAbout Us:At Ca-One Tech Cloud Inc., we are a dynamic team of experts who work on cutting-edge technology to deliver high-performance solutions. Our company culture values innovation, collaboration, and career growth.Estimated Salary: ₹20,00,000 - ₹30,00,000 per annumJob Description:We are...


  • India ExaTech Inc Full time

    ExaTech Inc is seeking an experienced Cloud Data Engineer to join our team in Chennai or Hyderabad, working onsite. This role involves designing and implementing ETL processes for our data warehouse using distributed databases, particularly Informatica Cloud (IICS).We are looking for someone with a strong background in data modeling, schema design, data...


  • India RAPIDFORT Full time

    Job SummaryRapidfort is seeking an experienced Cloud Native Systems Engineer to design, implement, and optimize scalable systems for large-scale applications and data processing workflows.The ideal candidate will have a strong background in Python, Linux, and distributed computing. Experience with Docker and Kubernetes (K8s) is required.


  • India Neem Full time

    At Neem Consulting, we are seeking a highly skilled Backend Engineer to join our team in India, working remotely or in a hybrid setup. As a Senior Backend Engineer, you will play a key role in building and scaling our client platform, which is an award-winning innovative startup aiming to revolutionise collaboration in hybrid meetings.We have an estimated...


  • India SambaNova Systems Full time

    About SambaNova SystemsWe are a leading technology company at the forefront of AI and machine learning innovation. Our cutting-edge system software solutions empower businesses to drive transformation and growth.Estimated Salary: $250,000 - $350,000 per yearJob OverviewThis role presents a unique opportunity to shape and work on high-performance system...


  • India Airties Full time

    Airties is a leading provider of Wi-Fi Mesh solutions to operators globally, empowering broadband providers to deliver seamless wireless integration and increased coverage. We are seeking an experienced Cloud Systems Engineer Leader to join our team in Bangalore, India.Job OverviewThe successful candidate will lead the development and maintenance of our...


  • India Exasoft Full time

    About ExasoftWe are a leading technology company with a strong presence in Singapore. Our team is passionate about delivering innovative solutions that meet the evolving needs of our clients.Salary and BenefitsWe offer an attractive salary package, estimated to be around SGD 180,000 per annum, plus additional benefits that include comprehensive health...