
SRE - AI_ML Support Engineer
20 hours ago
SRE - AI_ML Support Engineer – JD
We are hiring a "SRE [Site Reliability Engineer] AI ML Support" engineer for our "Enterprise-grade high-
performance supercomputing" platform. We are helping enterprises and service providers build their AI
inference platforms for end users, powered by our state-of-the-art RDU (Reconfigurable Dataflow Unit)
hardware architecture. Our cloud-agnostic, enterprise-grade MLOps platform abstracts infrastructure
complexity and enables seamless deployment, management, and scaling of foundation model workloads at
production scale. You'll contribute to the core of our enterprise-grade AI platform, collaborating across teams to
ensure our systems are performant, secure, and built to last. This is a high-impact, high-visibility role working
at the intersection of AI infrastructure, enterprise software, and developer experience.
Minimum Requirements:
- Foundational ML knowledge with hands-on experience working with machine learning models,
especially large language models (LLMs) and LLM APIs
- Strong programming skills in Python, including working with ML frameworks (PyTorch, Huggingface,
LangChain, etc) as well as building scripts, automation
- Solid understanding of Generative AI concepts (such as RAG) and applied use cases
- Exposure to Linux systems and familiarity with troubleshooting environment/setup issues
- Ability to investigate, triage, and resolve customer or internal issues related to ML workflows, APIs, and
AI-based applications
- Experience with issue tracking, documentation, and collaboration platforms (e.g., ticketing systems,
project tracking tools, knowledge bases)
- Proficiency with Docker for containerization and shell scripting for system automation
- Good communication and collaboration skills to work with cross-functional teams as well as external
customers or stakeholders
Nice to have:
- Familiarity with multi-modal models (e.g. Llama 4 Maverick)
- Familiarity with ML Ops practices – monitoring, observability, exposure to related libraries and
frameworks like OpenSearch, Prometheus and Grafana
- Strong hands-on exposure to Linux system administration and network administration, including
troubleshooting, system monitoring, and optimizing performance
- Experience working with Kubernetes (on-prem deployments preferred) for managing containerized ML
workloads
- Exposure to one or more public cloud platforms (AWS, GCP, Azure, etc)
- Strong customer-facing communication skills to handle escalations, reliability concerns, and solution
discussions with stakeholders and clients in a B2B environment
Ways to stand out from the crowd:
- Prior experience working with APIs and SDKs of major LLM providers (OpenAI, Anthropic, Hugging
Face, etc)
- Demonstrated ability to resolve complex issues in production ML systems
- Knowledge of fine-tuning, prompt engineering, and optimizing LLM usage in production
Job Type: Full-time
Pay: ₹500, ₹1,719,712.72 per year
Benefits:
- Provident Fund
Work Location: In person
-
SRE Engineer
5 days ago
Pune, Maharashtra, India Techno Facts Solutions Full time ₹ 5,00,000 - ₹ 15,00,000 per yearRole OverviewWe are seeking an experienced Site Reliability Engineer (SRE) with a strong background in automation, monitoring, and performance optimization. The ideal candidate will be proficient in scripting (Python, Bash, ), observability tools, and incident response, ensuring reliability and scalability of enterprise applications.Key...
-
Sre & Devops Engineer
7 days ago
Pune, Maharashtra, India METRO Global Solutions Center Full timeCompany Description Metro Global Solution Center MGSC is internal solution partner for METRO a EUR31 Billion international wholesaler with operations in more than 30 countries The store network comprises a total of 623 stores in 21 countries of which 522 offer out-of-store delivery OOS and 94 dedicated depots In 12 countries METRO runs only the...
-
SRE & DevOps Engineer
1 week ago
Pune, Maharashtra, India METROMAKRO Full time ₹ 1,04,000 - ₹ 1,30,878 per yearCompany Description Metro Global Solution Center (MGSC) is internal solution partner for METRO, a €31 Billion international wholesaler with operations in more than 30 countries. The store network comprises a total of 623 stores in 21 countries, of which 522 offer out-of-store delivery (OOS), and 94 dedicated depots. In 12 countries, METRO runs only the...
-
SRE & DevOps Engineer
1 week ago
Pune, Maharashtra, India METRO Global Solution Center IN Full time ₹ 1,04,000 - ₹ 1,30,878 per yearMetro Global Solution Center (MGSC) is internal solution partner for METRO, a €31 Billion international wholesaler with operations in more than 30 countries. The store network comprises a total of 623 stores in 21 countries, of which 522 offer out-of-store delivery (OOS), and 94 dedicated depots. In 12 countries, METRO runs only the delivery business by...
-
Sre
2 weeks ago
Pune, Maharashtra, India Hitachi Solutions Full timeCompany Description About Hitachi Solutions India Pvt Ltd Hitachi Solutions Ltd headquartered in Tokyo Japan is a core member of Information Telecommunication Systems Company of Hitachi Group and a recognized leader in delivering proven business and IT strategies and solutions to companies across many industries The company provides value-driven...
-
Level 2- Production Support/SRE
3 weeks ago
Pune, Maharashtra, India Triunity Software, Inc. Full timeHi,This is Prashant from Triunity Software IncPlease follow me on Linkedin - https://www.linkedin.com/in/usaprashantrathore/JD:Job Title: Level 2 Product Support Engineer / SREWork Mode :: Onsite (Pune, Bangalore, Chennai. Mumbai) :: Preferred Local CandidatesExperience: 4+ years, 5+ Years, 7+ Years (Associate & Mid- Senior Level)Type: Full-time / ContractWe...
-
Level 2- Production Support/SRE
3 weeks ago
Pune, Maharashtra, India Triunity Software, Inc. Full timeHi, This is Prashant from Triunity Software Inc Please follow me on Linkedin - JD: Job Title: Level 2 Product Support Engineer / SRE Work Mode :: Onsite (Pune, Bangalore, Chennai. Mumbai) :: Preferred Local Candidates Experience: 4+ years, 5+ Years, 7+ Years (Associate & Mid- Senior Level) Type: Full-time / Contract We are seeking a Level 2 Product Support...
-
SRE Engineer
2 weeks ago
Pune, Maharashtra, India InfoVision Inc. Full timeJob DescriptionCritical Skills To Possess- 5+ years of Site Reliability Engineering, DevOps, or Infrastructure Engineering experience- SRE Principles: Deep understanding of SLOs, SLIs, error budgets, and reliability engineering practices- Incident Management: Proven experience with incident response, on-call rotations, and post-mortem processes- Automation:...
-
SRE Engineer
2 weeks ago
Pune, Maharashtra, India InfoVision Inc. Full time ₹ 1,50,000 - ₹ 28,00,000 per yearCritical Skills To Possess5+ years of Site Reliability Engineering, DevOps, or Infrastructure Engineering experienceSRE Principles: Deep understanding of SLOs, SLIs, error budgets, and reliability engineering practicesIncident Management: Proven experience with incident response, on-call rotations, and post-mortem processesAutomation: Strong scripting...
-
SRE Team Lead and Engineer
1 week ago
Pune, Maharashtra, India Apex One Full time ₹ 1,04,000 - ₹ 1,30,878 per yearLead and mentor a team of SRE engineers, fostering a reliability, efficiency, and continuous improvement culture.Develop and execute SRE strategies to enhance our systems and services' reliability, availability, and performance.Designed and implemented observability and monitoring solutions using tools like New Relic, Azure Application Insights, AWS X-Ray,...