
Sr. AI System Infrastructure Engineer
3 weeks ago
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
MTS SOFTWARE DEVELOPMENT ENGINEER
THE ROLE:
AMD is looking for a specializedsoftware engineer who is passionate about improving the performance of key applications and benchmarks. You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology.
THE PERSON:
The ideal candidate should be passionate about software engineering and possess leadership skills to drive sophisticated issues to resolution. Able to communicate effectively and work optimally with different teams across AMD.
KEY RESPONSIBILITIES:
- Design, develop, and optimize algorithms for collective communication operations (e.g., All-Reduce, All-to-All, Broadcast) within AMD's RCCL.
- Analyze and tune the performance of collective communication libraries on large-scale GPU clusters, focusing on latency, bandwidth, and scalability over high-speed network fabrics.
- Integrate and validate RCCL with various network transport layers and protocols, such as UEC, RoCE (RDMA over Converged Ethernet), and custom interconnects.
- Collaborate closely with hardware, driver, and machine learning framework teams to co-design and debug system-level performance issues.
- Develop robust benchmarking and profiling tools to identify and resolve bottlenecks in the communication software stack.
- Contribute to the upstream open-source RCCL project and stay current with the latest advancements in the field.
- Provide expert guidance on GPU cluster network topology and configuration to maximize collective communication performance.
PREFERRED EXPERIENCE:
- 10+ years of experience in software development with a strong focus on high-performance computing or distributed systems.
- Highly proficient in C/C++ programming and debugging in a Linux environment.
- Experience with performance analysis, profiling, and debugging of complex, distributed systems in a Linux environment.
- Proven track record of optimizing software for specific hardware architectures.
- Strong analytical and problem-solving skills, with a proven ability to diagnose and resolve complex performance issues.
- Hands-on experience with AMD ROCm RCCL or similar GPU collective communication libraries (NVIDIA NCCL, MSICCL, oneCCL, Open MPI, MPICH, or other MPI implementations etc.), RoCEv2/RDMA is a huge plus.
- Effective communication and problem-solving skills.
ACADEMIC CREDENTIALS:
- Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
#LI-PM2
Benefits offered are described: .
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants needs under the respective laws throughout all stages of the recruitment and selection process.
-
Senior ai engineer
2 days ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/Singapore About Bug Raid. AI Incidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire. Bug Raid. AI is building the world’s first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve...
-
Senior AI Engineer
2 weeks ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/SingaporeAbout BugRaid.AIIncidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire.BugRaid.AI is building the world's first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve complex production...
-
Senior AI Engineer
2 weeks ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/Singapore About BugRaid.AI Incidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire. BugRaid.AI is building the world's first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve complex...
-
Senior AI Engineer
7 days ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/Singapore About BugRaid.AI Incidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire. BugRaid.AI is building the world’s first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve...
-
Senior AI Engineer
7 days ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/SingaporeAbout BugRaid.AIIncidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire.BugRaid.AI is building the world’s first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve complex production...
-
Senior AI Engineer
6 days ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/Singapore About BugRaid.AI Incidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire. BugRaid.AI is building the world’s first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve...
-
Senior AI Engineer
6 days ago
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/Singapore About BugRaid.AI Incidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire. BugRaid.AI is building the world’s first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve complex...
-
India BugRaid AI Full timeLocation: Hyderabad/Bangalore/Singapore About BugRaid.AI Incidents are the silent killers of modern enterprises. Every minute of downtime means lost revenue, lost trust, and engineers under fire. BugRaid.AI is building the world’s first enterprise-ready incident copilot — intelligent, agentic systems that can detect, diagnose, and resolve complex...
-
Staff Backend Engineer – Core AI Platform
2 weeks ago
India (Remote) Interface AI Full time US$ 1,50,000 - US$ 2,00,000 per yearLocation: India (Remote)Function: Engineering – AI PlatformLevel: StaffReports to: VP of Engineering / CTO About the Role We're hiring a Staff Backend Engineer – Core AI Platform to architect and lead the development of the foundational multi-agent infrastructure powering the next generation of intelligent systems for financial institutions. This role...
-
Sr. Engineer
2 weeks ago
India Dentistry Automation Full timeDentistry Automation is expanding our engineering team as we move into AI-powered RCM automation (eligibility verification, claims, payment posting). We already have strong architecture leadership in place, and now we're bringing on a Sr. Engineer to partner with our architect and developers and accelerate delivery of next-generation AI workflows. This role...