AI SRE
3 days ago
TCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and there’s nothing that can stop us from growing together.What we are looking forRole: AI SRE (Docker,kuberenetes,Ansible)Experience Range: 6 – 8 YearsLocation: BangaloreMust Have:Production experience in SRE / Infrastructure / ops for large-scale systemsStrong programming/scripting skills (Python, Go, Java, or equivalent)Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architecturesExperience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)Solid experience in capacity planning, performance tuning, scaling, and incident responseDemonstrated ability to lead RCAs, deploy fixes, and drive reliability improvementsExperience in regulated environments (financial services, compliance, audit, security) is a strong plusExcellent communication, documentation, and cross-team collaboration skillsProven track record of reducing operational toil via automationGood to Have:Understanding of SRE techniques. Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.Experience working with Generative AI development, embeddings, fine tuning of Generative AI models. Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)Understanding of ModelOps/ ML Ops/ LLM Op.Experience with chaos engineering, canary deployments, blue/green rolloutsEssential:Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)Design and build automation for core platform capabilities, reducing manual toilDevelop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboardsLead incident response, root cause analysis (RCA), postmortems, and systemic remediationPerform capacity planning, scaling strategies, workload scheduling, and resource forecastingOptimize cost vs. performance tradeoffs in large-scale compute environmentsHarden systems for security, compliance, auditability, and data governanceCollaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systemsDefine disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanismsMaintain runbooks, operational playbooks, documentation, and training materialsParticipate in on-call rotations and respond to production incidents 24/7 as neededContinuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliabilityMinimum Qualification: •15 years of full-time education•Minimum percentile of 50% in 10th, 12th, UG & PG (if applicable)
-
SRE & DevOps Engineer (ML/AI Platform)
1 week ago
bangalore, India Prospance Inc Full timeSRE & DevOps Engineer (ML/AI Platform)Contract Position | Global E-Commerce Leader | HybridAbout the OpportunityWe're partnering with a leading global e-commerce company to find an exceptional SRE & DevOps Engineer to join their AI Platform Team. This is your chance to shape the future of machine learning infrastructure that powers innovation for millions of...
-
SRE & DevOps Engineer (ML/AI Platform)
1 week ago
Bangalore, India Prospance Inc Full timeSRE & DevOps Engineer (ML/AI Platform) Contract Position | Global E-Commerce Leader | Hybrid About the Opportunity We're partnering with a leading global e-commerce company to find an exceptional SRE & DevOps Engineer to join their AI Platform Team. This is your chance to shape the future of machine learning infrastructure that powers innovation for millions...
-
SRE & DevOps Engineer (ML/AI Platform)
1 week ago
bangalore district, India Prospance Inc Full timeSRE & DevOps Engineer (ML/AI Platform) Contract Position | Global E-Commerce Leader | Hybrid About the Opportunity We're partnering with a leading global e-commerce company to find an exceptional SRE & DevOps Engineer to join their AI Platform Team. This is your chance to shape the future of machine learning infrastructure that powers innovation for millions...
-
SRE / DevOps Platform Engineer
1 week ago
Bangalore, India Prospance Inc Full timeSRE & DevOps Engineer (ML/AI Platform) Contract Position | Global E-Commerce Leader | Hybrid We're partnering with a leading global e-commerce company to find an exceptional SRE & DevOps Engineer to join their AI Platform Team. This is your chance to shape the future of machine learning infrastructure that powers innovation for millions of users worldwide....
-
Senior DevOps Engineer
1 day ago
bangalore, India MightyBot Full timeTitle: Senior DevOps Engineer (SRE) Location: Remote Join our team as a Senior DevOps Engineer, where we're focused on graduating AI from interesting demos to indispensable products. You will build and maintain the robust, scalable infrastructure that makes this possible, ensuring our platform is reliable enough to be trusted with critical business...
-
SRE Devops Manager
1 day ago
bangalore, India Infinite Computer Solutions Full timeWe are looking for Site Reliability Engineering (SRE) Devops ManagerLocation: Bangalore / Hyderabad / Chennai / Noida / Pune / Visakhapatnam / GurgaonShift timing: regularCan join Immediate - 30 daysInterested candidates, Please share your profiles and below details toEmail ID: Shanmukh.Varma@infinite.comTotal experience:Relevant Experience:Current...
-
SRE & DevOps Engineer (Node.js )
1 week ago
Bangalore, India Prospance Inc Full timeAbout the Opportunity We're partnering with a leading global e-commerce company to find an exceptional SRE & DevOps Engineer with strong Node.js and UI development expertise. Join their AI Platform Team and build the developer-facing tools and infrastructure that empower researchers and data scientists worldwide. In this unique role, you'll bridge backend...
-
Senior DevOps Engineer
2 hours ago
bangalore, India MightyBot Full timeTitle: Senior DevOps Engineer (SRE) Location: Remote Join our team as a Senior DevOps Engineer, where we're focused on graduating AI from interesting demos to indispensable products. You will build and maintain the robust, scalable infrastructure that makes this possible, ensuring our platform is reliable enough to be trusted with critical business...
-
SRE Devops Lead
37 minutes ago
Bangalore, India Infinite Computer Solutions Full timeWe are looking for Site Reliability/Cloud Engineer Devops Lead / SSE Experience - 6 years - 12 years Can join immediate - 30 days Shift timing: Regular Location: Bangalore / Hyderabad / Chennai / Noida / Pune / Gurgaon / Visakhapatnam Interested candidates, Please share your profiles and below details to Email ID: Total experience: Relevant Experience:...
-
SRE Devops Manager
34 minutes ago
Bangalore, India Infinite Computer Solutions Full timeWe are looking for Site Reliability Engineering (SRE) Devops Manager Location: Bangalore / Hyderabad / Chennai / Noida / Pune / Visakhapatnam / Gurgaon Shift timing: regular Can join Immediate - 30 days Interested candidates, Please share your profiles and below details to Email ID: Total experience: Relevant Experience: Current CTC: Expected CTC: Notice...