AI SRE
7 days ago
TCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and theres nothing that can stop us from growing together.
What we are looking for
Role: AI SRE (Docker,kuberenetes,Ansible)
Experience Range: 6 – 8 Years
Location: Bangalore
Must Have:
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
- Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
- Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
- Excellent communication, documentation, and cross-team collaboration skills
- Proven track record of reducing operational toil via automation
Good to Have:
- Understanding of SRE techniques.
- Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.
- Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.
- Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
- Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.
- Experience working with Generative AI development, embeddings, fine tuning of Generative AI models.
- Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
- Understanding of ModelOps/ ML Ops/ LLM Op.
- Experience with chaos engineering, canary deployments, blue/green rollouts
Essential:
- Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
- Design and build automation for core platform capabilities, reducing manual toil
- Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
- Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
- Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
- Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
- Optimize cost vs. performance tradeoffs in large-scale compute environments
- Harden systems for security, compliance, auditability, and data governance
- Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
- Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
- Maintain runbooks, operational playbooks, documentation, and training materials
- Participate in on-call rotations and respond to production incidents 24/7 as needed
- Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
Minimum Qualification:
- 15 years of full-time education
- Minimum percentile of 50% in 10th, 12th, UG & PG (if applicable)
-
Senior SRE
7 days ago
Bengaluru, Karnataka, India Red Hat Full time ₹ 15,00,000 - ₹ 25,00,000 per yearThe IT AI Application Platform team is seeking a Senior Site Reliability Engineer (SRE) to develop, scale, and operate our AI Application Platform based on Red Hat technologies, including OpenShift AI (RHOAI) and Red Hat Enterprise Linux AI (RHEL AI). As an SRE you will contribute to running core AI services at scale by enabling customer self-service, making...
-
Java SRE
7 days ago
Bengaluru, Karnataka, India Tata Consultancy Services Full time ₹ 15,00,000 - ₹ 25,00,000 per yearKey Responsibilities: Provide SRE/Production Support for critical systems, ensuring high availability and reliability. Develop and maintain automation scripts using UNIX/Shell scripting. Troubleshoot and resolve production issues to minimize downtime and improve system performance. Implement and manage observability tools such as Splunk, Grafana, and ELK for...
-
SRE & DevOps (ML Framework) - AI Platform
5 days ago
Bengaluru, Karnataka, India ITC Infotech Full time ₹ 20,00,000 - ₹ 25,00,000 per yearSRE & DevOps (ML Framework) - AI PlatformLocation : BangaloreMode: HybridRequired Skills:● Demonstrated ability in designing, building, refactoring and releasing software written in Python.● Hands-on experience with ML frameworks such as PyTorch, TensorFlow, Triton.● Ability to handle framework-related issues, version upgrades, and compatibility with...
-
AI Platform
2 weeks ago
Bengaluru, Karnataka, India Infogrowth Full time ₹ 12,00,000 - ₹ 36,00,000 per yearWe're seeking an experienced SRE & DevOps Engineer ) to support eBay's AI Platform. You'll design, automate, and manage high-availability systems, CI/CD pipelines, and DevOps practices for ML infrastructure. Required Candidate profileExperienced SRE/DevOps Engineer skilled in , Kubernetes, Docker, CI/CD, and automation. Strong debugging, scripting,...
-
Staff SRE, Application SRE
2 weeks ago
Bengaluru, Karnataka, India Netskope Full time ₹ 20,00,000 - ₹ 25,00,000 per yearAbout NetskopeToday, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security.Since 2012, we have built the...
-
Principal SRE
7 days ago
Bengaluru, Karnataka, India Red Hat Full time ₹ 12,00,000 - ₹ 36,00,000 per yearAbout The JobThe IT AI Application Platform team is seeking a Principal Senior Site Reliability Engineer (SRE) to design, develop, scale, and operate our AI Application Platform based on Red Hat technologies, including OpenShift AI (RHOAI) and Red Hat Enterprise Linux AI (RHEL AI). As a Principal SRE you will contribute to running core AI services at scale...
-
Sr. / Staff SRE, Application SRE
1 day ago
Bengaluru, Karnataka, India Netskope Full time ₹ 1,50,00,000 - ₹ 2,50,00,000 per yearAbout NetskopeToday, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security. Since 2012, we have built...
-
SRE & DevOps Engineer (with Java/MLops ()
5 days ago
Bengaluru, Karnataka, India N-iX Full time ₹ 9,00,000 - ₹ 12,00,000 per yearN-iX is a global software development company founded in 2002, connecting over 2,400+ tech professionals across 40+ countries. We deliver innovative technology solutions in cloud computing, data analytics, AI, embedded software,IoT, and more to global industry leaders and Fortune 500 companies. Join us to create technology that drives real change for...
-
DevOps/SRE
1 week ago
Bengaluru, Karnataka, India Selector Software Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSelector is building an operational intelligence platform for digital infrastructure. By adopting an AI/ML based analytics approach, the platform provides actionable multi-dimensional insights to network, cloud and application operators. It enables operations teams to meet their KPIs through seamless collaboration, search-driven conversational user...
-
AI Data Platform Reliability
7 days ago
Bengaluru, Karnataka, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per yearResponsibilitiesDesign, develop, and execute end-to-end (E2E) scenario validations that simulate real-world usage of complex AI data platform workflows (data ingestion, transformation, ML pipeline orchestration, etc.).Collaborate closely with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies.Develop and...