AI SRE

7 days ago

Bengaluru, Karnataka, India Tata Consultancy Services Full time ₹ 15,00,000 - ₹ 30,00,000 per year

TCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and theres nothing that can stop us from growing together.

What we are looking for

Role: AI SRE (Docker,kuberenetes,Ansible)

Experience Range: 6 – 8 Years

Location: Bangalore

Must Have:

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation

Good to Have:

Understanding of SRE techniques.
Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.
Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.
Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.
Experience working with Generative AI development, embeddings, fine tuning of Generative AI models.
Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
Understanding of ModelOps/ ML Ops/ LLM Op.
Experience with chaos engineering, canary deployments, blue/green rollouts

Essential:

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Minimum Qualification:

15 years of full-time education
Minimum percentile of 50% in 10th, 12th, UG & PG (if applicable)

Senior SRE

7 days ago

Bengaluru, Karnataka, India Red Hat Full time ₹ 15,00,000 - ₹ 25,00,000 per year

The IT AI Application Platform team is seeking a Senior Site Reliability Engineer (SRE) to develop, scale, and operate our AI Application Platform based on Red Hat technologies, including OpenShift AI (RHOAI) and Red Hat Enterprise Linux AI (RHEL AI). As an SRE you will contribute to running core AI services at scale by enabling customer self-service, making...
Java SRE

7 days ago

Bengaluru, Karnataka, India Tata Consultancy Services Full time ₹ 15,00,000 - ₹ 25,00,000 per year

Key Responsibilities: Provide SRE/Production Support for critical systems, ensuring high availability and reliability. Develop and maintain automation scripts using UNIX/Shell scripting. Troubleshoot and resolve production issues to minimize downtime and improve system performance. Implement and manage observability tools such as Splunk, Grafana, and ELK for...
SRE & DevOps (ML Framework) - AI Platform

5 days ago

Bengaluru, Karnataka, India ITC Infotech Full time ₹ 20,00,000 - ₹ 25,00,000 per year

SRE & DevOps (ML Framework) - AI PlatformLocation : BangaloreMode: HybridRequired Skills:● Demonstrated ability in designing, building, refactoring and releasing software written in Python.● Hands-on experience with ML frameworks such as PyTorch, TensorFlow, Triton.● Ability to handle framework-related issues, version upgrades, and compatibility with...
AI Platform

2 weeks ago

Bengaluru, Karnataka, India Infogrowth Full time ₹ 12,00,000 - ₹ 36,00,000 per year

We're seeking an experienced SRE & DevOps Engineer ) to support eBay's AI Platform. You'll design, automate, and manage high-availability systems, CI/CD pipelines, and DevOps practices for ML infrastructure. Required Candidate profileExperienced SRE/DevOps Engineer skilled in , Kubernetes, Docker, CI/CD, and automation. Strong debugging, scripting,...
Staff SRE, Application SRE

2 weeks ago

Bengaluru, Karnataka, India Netskope Full time ₹ 20,00,000 - ₹ 25,00,000 per year

About NetskopeToday, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security.Since 2012, we have built the...
Principal SRE

7 days ago

Bengaluru, Karnataka, India Red Hat Full time ₹ 12,00,000 - ₹ 36,00,000 per year

About The JobThe IT AI Application Platform team is seeking a Principal Senior Site Reliability Engineer (SRE) to design, develop, scale, and operate our AI Application Platform based on Red Hat technologies, including OpenShift AI (RHOAI) and Red Hat Enterprise Linux AI (RHEL AI). As a Principal SRE you will contribute to running core AI services at scale...
Sr. / Staff SRE, Application SRE

1 day ago

Bengaluru, Karnataka, India Netskope Full time ₹ 1,50,00,000 - ₹ 2,50,00,000 per year

About NetskopeToday, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security. Since 2012, we have built...
SRE & DevOps Engineer (with Java/MLops ()

5 days ago

Bengaluru, Karnataka, India N-iX Full time ₹ 9,00,000 - ₹ 12,00,000 per year

N-iX is a global software development company founded in 2002, connecting over 2,400+ tech professionals across 40+ countries. We deliver innovative technology solutions in cloud computing, data analytics, AI, embedded software,IoT, and more to global industry leaders and Fortune 500 companies. Join us to create technology that drives real change for...
DevOps/SRE

1 week ago

Bengaluru, Karnataka, India Selector Software Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Selector is building an operational intelligence platform for digital infrastructure. By adopting an AI/ML based analytics approach, the platform provides actionable multi-dimensional insights to network, cloud and application operators. It enables operations teams to meet their KPIs through seamless collaboration, search-driven conversational user...
AI Data Platform Reliability

7 days ago

Bengaluru, Karnataka, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per year

ResponsibilitiesDesign, develop, and execute end-to-end (E2E) scenario validations that simulate real-world usage of complex AI data platform workflows (data ingestion, transformation, ML pipeline orchestration, etc.).Collaborate closely with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies.Develop and...

Americas

Europe

Asia / Oceania

Africa

AI SRE