Distributed Systems Engineer
2 weeks ago
As a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B parameters.
Key Responsibilities:
- Develop and scale distributed systems tailored for high-performance AI/ML workloads, focusing on eliminating delays caused by traditional checkpointing.
- Design fault-tolerant and high-availability systems that ensure seamless operation and rapid recovery, even during infrastructure failures.
- Implement advanced data partitioning, synchronization, and parallel computation techniques to handle terabytes of data and optimize memory usage across multi-node setups.
- Collaborate with ML and infrastructure engineers to design innovative solutions for distributed training and inference of large-scale models.
- Identify and resolve performance bottlenecks, particularly those arising from storage, memory, or network constraints in AI workflows.
- Stay at the forefront of emerging distributed computing trends, such as zero-copy memory sharing, efficient in-memory data storage, and distributed model execution, to ensure your solutions remain cutting-edge.
- Ability to adapt to new technologies and take on new responsibilities and roles in a fast-paced growing company.
Minimum Qualifications:
- Bachelor's degree in Computer Science, Distributed Systems, Computer Engineering, or a related field.
- 5+ years of experience in designing and implementing distributed systems.
- Proficiency in programming languages such as Python, C++, or Java.
- Strong understanding of distributed computing principles, including fault tolerance, synchronization, and parallel computation.
- Experience with distributed training frameworks such as PyTorch Distributed, TensorFlow Distributed, or DeepSpeed.
- Familiarity with cloud platforms (AWS, GCP, or Azure) and managing multi-node infrastructure.
- Demonstrated ability to troubleshoot performance bottlenecks in distributed systems.
Preferred Qualifications:
- Master’s or Ph.D. in Computer Science, Distributed Systems, Computer Engineering, or a related field.
- 7+ years of hands-on experience with large-scale distributed systems for AI/ML workloads.
- Expertise in advanced distributed systems concepts, such as zero-copy memory sharing, RDMA, and NVMe-based storage.
- Experience working at Nvidia, AMD, AWS, or a similar distributed systems-focused organization.
- Proven track record of optimizing distributed systems for AI/ML models with 1B+ parameters.
- Strong knowledge of network optimization techniques for high-performance computing.
- Familiarity with cutting-edge AI/ML trends and the ability to integrate them into distributed architectures.
-
Distributed systems engineer
1 week ago
India Restored Cloud Full timeAs a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1 B...
-
Distributed Systems Engineer
2 weeks ago
India Restored Cloud Full timeAs a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...
-
Distributed Systems Engineer
2 weeks ago
India Restored Cloud Full timeAs a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...
-
Restored Cloud | Distributed Systems Engineer
2 weeks ago
india Restored Cloud Full timeAs a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...
-
Restored Cloud | Distributed Systems Engineer
2 weeks ago
india Restored Cloud Full timeAs a Distributed Systems Engineer at Restored Cloud, you will be key in designing and optimizing distributed infrastructure tailored for large-scale AI/ML model training and inference. Your primary focus will address challenges like minimizing checkpointing delays, enabling seamless fault recovery, and maximizing resource utilization for models exceeding 1B...
-
Cloud Database Administrator
1 month ago
India Persistent Systems Full timeAbout Position: We are on the lookout for a seasoned Cloud Database Administrator with a specialized focus on distributed database systems and a minimum of 5 years of experience. The successful candidate will be instrumental in managing, scaling, and ensuring the reliability of our distributed databases deployed on cloud infrastructure, with a particular...
-
Senior Distributed Systems Engineer
2 weeks ago
India SambaNova Systems Full timeAbout SambaNova SystemsWe are a leading technology company at the forefront of AI and machine learning innovation. Our cutting-edge system software solutions empower businesses to drive transformation and growth.Estimated Salary: $250,000 - $350,000 per yearJob OverviewThis role presents a unique opportunity to shape and work on high-performance system...
-
Senior Distributed Systems Engineer
1 month ago
India RAPIDFORT Full timeJob SummaryRapidfort is seeking an experienced Cloud Native Systems Engineer to design, implement, and optimize scalable systems for large-scale applications and data processing workflows.The ideal candidate will have a strong background in Python, Linux, and distributed computing. Experience with Docker and Kubernetes (K8s) is required.
-
Senior Software Engineer
1 week ago
India AiDASH Full timeAbout the RoleWe are seeking a highly skilled Senior Software Engineer to join our team at AiDash. As a key member of our engineering team, you will be responsible for designing and building scalable distributed systems that support our mission to make critical infrastructure industries climate-resilient and sustainable.Key ResponsibilitiesDesign, develop,...
-
Senior Distributed Systems Architect
4 weeks ago
India Ai Palette Full timeAbout the JobAi Palette is a cutting-edge Food AI company on a mission to revolutionize the industry with its innovative SaaS platform.We are seeking an experienced Senior Distributed Systems Architect to join our team in Singapore. The ideal candidate will have a strong background in designing and developing large-scale distributed systems, as well as...
-
Distributed Systems Engineer
2 weeks ago
India Exasoft Full timeAbout ExasoftWe are a leading technology company with a strong presence in Singapore. Our team is passionate about delivering innovative solutions that meet the evolving needs of our clients.Salary and BenefitsWe offer an attractive salary package, estimated to be around SGD 180,000 per annum, plus additional benefits that include comprehensive health...
-
Data Engineer for Distributed Systems
2 weeks ago
India Grid Dynamics Full timeJob SummaryWe are seeking an experienced Data Engineer to join our team at Grid Dynamics. The ideal candidate will have a strong background in designing and developing large-scale applications using open-source technologies, with a focus on Big Data technologies like Hadoop, Spark, Hive, and MapReduce.About the RoleThis is a hybrid position based in...
-
Distributed Systems Engineer
3 weeks ago
Anywhere in India/Multiple Locations ca-one tech cloud inc Full timeJob Title: Distributed Systems Engineer - Kafka and Java ExpertAbout Us:At Ca-One Tech Cloud Inc., we are a dynamic team of experts who work on cutting-edge technology to deliver high-performance solutions. Our company culture values innovation, collaboration, and career growth.Estimated Salary: ₹20,00,000 - ₹30,00,000 per annumJob Description:We are...
-
Cloud Data Engineer for Distributed Systems
4 weeks ago
India ExaTech Inc Full timeExaTech Inc is seeking an experienced Cloud Data Engineer to join our team in Chennai or Hyderabad, working onsite. This role involves designing and implementing ETL processes for our data warehouse using distributed databases, particularly Informatica Cloud (IICS).We are looking for someone with a strong background in data modeling, schema design, data...
-
India LinkedIn Full timeAre you a skilled software engineer looking for a challenging role in a fast-paced environment?We have an exciting opportunity for a Senior Software Engineer to join our team at LinkedIn, based in Bangalore, India.About the RoleWe are seeking a highly experienced software engineer to design and develop scalable, high-volume performing, and reliable system...
-
SambaNova Systems | Principal Software Engineer
2 weeks ago
india SambaNova Systems Full timeWorking at SambaNovaThis role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...
-
india Yield Engineering Systems Full timeYES (Yield Engineering Systems, Inc.) is a leading manufacturer of reliable, high-tech, cost-effective capital equipment that transforms materials and surfaces at the nanoscale. From startups to the Fortune 50, our customers rely on YES to help them unleash products that change lives – from cell phones and IoT devices to AI and virtual reality, to...
-
India Neem Full timeAt Neem Consulting, we are seeking a highly skilled Backend Engineer to join our team in India, working remotely or in a hybrid setup. As a Senior Backend Engineer, you will play a key role in building and scaling our client platform, which is an award-winning innovative startup aiming to revolutionise collaboration in hybrid meetings.We have an estimated...
-
SambaNova Systems | Principal Software Engineer
2 weeks ago
india SambaNova Systems Full timeWorking at SambaNova This role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...
-
SambaNova Systems | Principal Software Engineer
2 weeks ago
india SambaNova Systems Full timeWorking at SambaNova This role presents a unique opportunity to shape and work on cutting-edge system software solutions for AI and machine learning applications in the enterprise & commercial landscape. The stack spans multiple software layers, and provides products & services including but not limited to OS, software-hardware interface, isolation through...