SRE/AIOps Manager

2 days ago

Chennai, Tamil Nadu, India DXC Technology Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Job Description:

Site Reliability Engineering Manager - AI, Analytics & Automation Services

We are seeking an experienced Site Reliability Engineering (SRE) Manager to lead our reliability engineering efforts across our AI, Analytics, and Automation services portfolio. You will spearhead the development of comprehensive observability pipelines, intelligent monitoring systems, and automated resolution frameworks while building and managing a high-performing team of SREs. This role is critical to ensuring the reliability, scalability, and performance of our AI solutions, Databricks analytics platform, and UiPath automation infrastructure.

Key Responsibilities

Observability & Monitoring Excellence

Design and implement end-to-end observability pipelines spanning AI solutions, data processing workflows, and automation execution environments
Establish comprehensive monitoring strategies for AI model performance, drift detection, data quality, and service health across Databricks and UiPath platforms
Build real-time dashboards and alerting systems that provide actionable insights into system performance, resource utilization, and service reliability
Develop custom metrics and KPIs specific to AI/ML workloads, including model accuracy, latency, throughput, and resource consumption
Implement distributed tracing and logging solutions to enable rapid troubleshooting across complex AI and automation pipelines

Automated Resolution & Self-Healing Systems

Architect and deploy automated incident response systems that can detect, diagnose, and resolve common reliability issues without human intervention
Build intelligent event-triggered runbook automation
Implement chaos engineering practices to proactively identify and strengthen system weaknesses
Develop automated remediation workflows for infrastructure issues, service degradations, and capacity constraints
Create self-healing mechanisms for AI inference services, data pipeline failures, and automation workflow interruptions

Team Leadership & Development

Build, mentor, and lead a team of Site Reliability Engineers with expertise in AI/ML operations, data platforms, and automation technologies
Establish SRE best practices, standards, and processes tailored to AI and automation workloads
Foster a culture of reliability engineering, continuous improvement, and data-driven decision making
Conduct regular performance reviews, career development discussions, and technical skill assessments
Collaborate with engineering teams to embed reliability principles into the software development lifecycle

Platform Reliability & Performance

Ensure near zero downtime and optimal performance of AI solutions, Databricks analytics workloads, and UiPath automation processes
Design and implement disaster recovery and business continuity plans for critical AI and automation services
Optimize resource allocation and cost management across cloud infrastructure supporting AI, analytics, and automation workloads
Establish and maintain service level objectives (SLOs) and error budgets for all managed services
Drive capacity planning initiatives to support growing AI model deployment and automation scale requirements

Cross-Functional Collaboration

Partner with AI/ML developers to integrate reliability considerations into AI solutions and deployment pipelines
Work closely with data engineering teams to ensure robust, monitored data flows within Databricks environments
Collaborate with automation developers to build resilient UiPath bot deployment and execution frameworks
Interface with security teams to implement observability solutions that maintain compliance and data protection standards

Required Qualifications

7+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure roles
2+ years of management experience leading technical teams
Hands-on experience with observability tools such as Prometheus, Grafana, ELK Stack, Datadog, or New Relic
Proficiency in Infrastructure as Code (Terraform, CloudFormation, Ansible)
Strong scripting and automation skills (Python, Go, Bash, PowerShell)
Familiarity with Databricks platform administration, cluster management, and workflow orchestration
Knowledge of UiPath platform architecture, orchestrator management, and bot deployment strategies
Understanding of data pipeline monitoring, data quality validation, and ETL/ELT process reliability
Experience with ML model monitoring, A/B testing infrastructure, and feature store management
Proven track record of building and scaling high-performing engineering teams
Strong analytical and problem-solving skills with ability to troubleshoot complex distributed systems
Excellent communication skills with ability to present technical concepts to executive stakeholders
Experience driving cross-functional initiatives and influencing without direct authority
Demonstrated ability to balance operational excellence with strategic innovation

At DXC Technology, we believe strong connections and community are key to our success. Our work model prioritizes in-person collaboration while offering flexibility to support wellbeing, productivity, individual work styles, and life circumstances. We're committed to fostering an inclusive environment where everyone can thrive.

Recruitment fraud is a scheme in which fictitious job opportunities are offered to job seekers typically through online services, such as false websites, or through unsolicited emails claiming to be from the company. These emails may request recipients to provide personal information or to make payments as part of their illegitimate recruiting process. DXC does not make offers of employment via social media networks and DXC never asks for any money or payments from applicants at any point in the recruitment process, nor ask a job seeker to purchase IT or other equipment on our behalf. More information on employment scams is available here.

SRE - Software Engineer

6 days ago

Chennai, Tamil Nadu, India Ford Global Career Site Full time ₹ 15,00,000 - ₹ 25,00,000 per year

Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create...
Cloud SRE

2 days ago

Chennai, Tamil Nadu, India Ford Motor Company Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Be at the Forefront of Mobility's Future: Join Ford as a Site Reliability EngineerEnterprise Technology is the engine driving the future of transportation, and we're looking for a talented Site Reliability Engineer (SRE) to help us redefine mobility. In this role, you'll leverage cutting-edge technology to enhance customer experiences, improve lives, and...
SRE - Software Engineer

6 days ago

Chennai, Tamil Nadu, India Ford Motor Company Full time ₹ 10,00,000 - ₹ 25,00,000 per year

Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create...
SRE & Observability Administrator

2 weeks ago

Chennai, Tamil Nadu, India SARIKA MARKETING Full time ₹ 5,00,000 - ₹ 15,00,000 per year

WE are hiring for SRE & Observability Administrator.Role DescriptionThis is a full-time, on-site SRE & Observability Administrator position located in Chennai. The role will involve ensuring high availability and reliability of systems, implementing and managing observability solutions, and conducting thorough troubleshooting. The professional will also...
SRE Software Engineer

15 hours ago

Chennai, Tamil Nadu, India Ford Motor Company Full time ₹ 1,20,000 - ₹ 1,50,000 per year

Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create...
Senior Service Delivery Manager

4 days ago

Chennai, Tamil Nadu, India Zensar Technologies Full time ₹ 12,00,000 - ₹ 36,00,000 per year

What's this role about?We are looking for a Service Delivery Manager to leadProduct Sustenance / Application Management Servicesusing anAI-first Next Gen AMS model, integratingSREandKTLOpractices. The role focuses on delivering stable operations, Application support services (L2, L3 and Minor Enhancements) driving automation, and transformation by embedding...
SRE DevOps Engineer

2 days ago

Chennai, Tamil Nadu, India Hexaware Technologies Full time ₹ 15,00,000 - ₹ 25,00,000 per year

DevOps/SRE ensures reliability, scalability, and automation for cloud infrastructure and CI/CD pipelines on GCP Cloud.Core ResponsibilitiesBuild and maintain CI/CD pipelines.Provision cloud resources using Terraform.Implement monitoring, alerting, and observability systems.Support incident management and root cause analysis.Ensure production reliability...
SRE - Software Engineer

3 days ago

Chennai, Tamil Nadu, India Ford Motor Full time ₹ 80,000 - ₹ 12,00,000 per year

DescriptionEnterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help...
SRE Application Support Lead

2 weeks ago

Chennai, Tamil Nadu, India TransUnion Full time ₹ 12,00,000 - ₹ 24,00,000 per year

TransUnion's Job Applicant Privacy NoticeWhat We'll Bring:We are seeking a highly skilled and motivated SRE Application Support Lead / Sr. Lead to join our 24x7 support team. This role is critical to ensuring the stability, performance, and reliability of mission-critical applications deployed across modern platforms including Docker, Kubernetes, and cloud...
SRE Application Support Lead

2 weeks ago

Chennai, Tamil Nadu, India TransUnion Full time ₹ 8,00,000 - ₹ 12,00,000 per year

TransUnion's Job Applicant Privacy NoticeWhat We'll Bring:We are seeking a highly skilled and motivated SRE Application Support Lead / Sr. Lead to join our 24x7 support team. This role is critical to ensuring the stability, performance, and reliability of mission-critical applications deployed across modern platforms including Docker, Kubernetes, and cloud...

Americas

Europe

Asia / Oceania

Africa

SRE/AIOps Manager