SRE/AIOps Manager

1 day ago


Chennai, India DXC Technology Full time

Job Description: Site Reliability Engineering Manager - AI, Analytics & Automation Services We are seeking an experienced Site Reliability Engineering (SRE) Manager to lead our reliability engineering efforts across our AI, Analytics, and Automation services portfolio. You will spearhead the development of comprehensive observability pipelines, intelligent monitoring systems, and automated resolution frameworks while building and managing a high-performing team of SREs. This role is critical to ensuring the reliability, scalability, and performance of our AI solutions, Databricks analytics platform, and UiPath automation infrastructure. Key Responsibilities Observability & Monitoring Excellence - Design and implement end-to-end observability pipelines spanning AI solutions, data processing workflows, and automation execution environments - Establish comprehensive monitoring strategies for AI model performance, drift detection, data quality, and service health across Databricks and UiPath platforms - Build real-time dashboards and alerting systems that provide actionable insights into system performance, resource utilization, and service reliability - Develop custom metrics and KPIs specific to AI/ML workloads, including model accuracy, latency, throughput, and resource consumption - Implement distributed tracing and logging solutions to enable rapid troubleshooting across complex AI and automation pipelines Automated Resolution & Self-Healing Systems - Architect and deploy automated incident response systems that can detect, diagnose, and resolve common reliability issues without human intervention - Build intelligent event-triggered runbook automation - Implement chaos engineering practices to proactively identify and strengthen system weaknesses - Develop automated remediation workflows for infrastructure issues, service degradations, and capacity constraints - Create self-healing mechanisms for AI inference services, data pipeline failures, and automation workflow interruptions Team Leadership & Development - Build, mentor, and lead a team of Site Reliability Engineers with expertise in AI/ML operations, data platforms, and automation technologies - Establish SRE best practices, standards, and processes tailored to AI and automation workloads - Foster a culture of reliability engineering, continuous improvement, and data-driven decision making - Conduct regular performance reviews, career development discussions, and technical skill assessments - Collaborate with engineering teams to embed reliability principles into the software development lifecycle Platform Reliability & Performance - Ensure near zero downtime and optimal performance of AI solutions, Databricks analytics workloads, and UiPath automation processes - Design and implement disaster recovery and business continuity plans for critical AI and automation services - Optimize resource allocation and cost management across cloud infrastructure supporting AI, analytics, and automation workloads - Establish and maintain service level objectives (SLOs) and error budgets for all managed services - Drive capacity planning initiatives to support growing AI model deployment and automation scale requirements Cross-Functional Collaboration - Partner with AI/ML developers to integrate reliability considerations into AI solutions and deployment pipelines - Work closely with data engineering teams to ensure robust, monitored data flows within Databricks environments - Collaborate with automation developers to build resilient UiPath bot deployment and execution frameworks - Interface with security teams to implement observability solutions that maintain compliance and data protection standards Required Qualifications - 7+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure roles - 2+ years of management experience leading technical teams - Hands-on experience with observability tools such as Prometheus, Grafana, ELK Stack, Datadog, or New Relic - Proficiency in Infrastructure as Code (Terraform, CloudFormation, Ansible) - Strong scripting and automation skills (Python, Go, Bash, PowerShell) - Familiarity with Databricks platform administration, cluster management, and workflow orchestration - Knowledge of UiPath platform architecture, orchestrator management, and bot deployment strategies - Understanding of data pipeline monitoring, data quality validation, and ETL/ELT process reliability - Experience with ML model monitoring, A/B testing infrastructure, and feature store management - Proven track record of building and scaling high-performing engineering teams - Strong analytical and problem-solving skills with ability to troubleshoot complex distributed systems - Excellent communication skills with ability to present technical concepts to executive stakeholders - Experience driving cross-functional initiatives and influencing without direct authority - Demonstrated ability to balance operational excellence with strategic innovation At DXC Technology, we believe strong connections and community are key to our success. Our work model prioritizes in-person collaboration while offering flexibility to support wellbeing, productivity, individual work styles, and life circumstances. We’re committed to fostering an inclusive environment where everyone can thrive.


  • AIOps Engineer

    1 day ago


    Chennai, India Virtusa Full time

    AIOps Engineer - CREQ Description AIOps engineer / DevOps – AIOPS Experience: 6 - 9 years Roles and responsibilities : Create, Define, Build, Manage AI Ops platform Build customer specific solutions and use cases with ability to present solutions/demos. Good understanding of AI Ops, AI/ML technologies, ITOM, ITSM and Monitoring Tools. Have an in-depth...

  • SRE/AIOps Manager

    1 week ago


    Chennai, Tamil Nadu, India DXC Technology Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    Job Description:Site Reliability Engineering Manager - AI, Analytics & Automation ServicesWe are seeking an experienced Site Reliability Engineering (SRE) Manager to lead our reliability engineering efforts across our AI, Analytics, and Automation services portfolio. You will spearhead the development of comprehensive observability pipelines, intelligent...

  • SRE/AIOps Manager

    1 week ago


    Chennai, Tamil Nadu, India DXC Technologies Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Job Description : Site Reliability Engineering Manager - AI, Analytics & Automation Services We are seeking an experienced Site Reliability Engineering (SRE) Manager to lead our reliability engineering efforts across our AI, Analytics, and Automation services portfolio. You will spearhead the development of comprehensive observability pipelines,...

  • SRE Lead Consultant

    1 week ago


    Bengaluru, Chennai, Hyderabad, India Krazy Mantra HR Solutions Pvt. Ltd Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    We are looking for a skilled SRE Lead Consultant & SRE Principal consultant with 8 to 10 years of experience. The ideal candidate should have expertise in SRE concepts such as SLO, SLI, and error budgeting, deployment experience in APM tools & Cloud monitoring tools, Git and code-review systems, change management, Agile, ITIL concepts, SOP creation, and...


  • Bengaluru, Chennai, Hyderabad, India Krazy Mantra HR Solutions Pvt. Ltd Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    We are looking for a skilled Solution Architect with expertise in AIOPS to join our team in Bangalore, Hyderabad, Chennai, Pune, and Greater Noida. The ideal candidate will have 6+ years of hands-on delivery experience on products like RPA, Chat Bots, Orchestrators, and Application Performance Management tools.Roles and ResponsibilityDesign and implement...


  • Chennai, Tamil Nadu, India Ford Global Career Site Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create...

  • Cloud SRE

    1 week ago


    Chennai, Tamil Nadu, India Ford Motor Company Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Be at the Forefront of Mobility's Future: Join Ford as a Site Reliability EngineerEnterprise Technology is the engine driving the future of transportation, and we're looking for a talented Site Reliability Engineer (SRE) to help us redefine mobility. In this role, you'll leverage cutting-edge technology to enhance customer experiences, improve lives, and...

  • Mid-Level SRE

    1 week ago


    Chennai, Tamil Nadu, India Suzva Software Technologies Full time ₹ 9,00,000 - ₹ 12,00,000 per year

    Mid-Level SRE/DevOps Engineer (C2H) | Onsite - Coimbatore Azure DevOpsAutomate infra with Terraform (IaC)Monitor & optimize systems using Datadog, Prometheus, GrafanPosition: Mid-Level SRE/DevOps EngineerExperience: 5-6 YearsOpenings: 3Location: Coimbatore (Onsite)Engagement Type: Contract-to-Hire (C2H)Contract Duration: 6 months to 1 year (based on...


  • Chennai, Tamil Nadu, India Ford Motor Company Full time ₹ 10,00,000 - ₹ 25,00,000 per year

    Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create...


  • Chennai, Tamil Nadu, India Ford Motor Company Full time ₹ 1,20,000 - ₹ 1,50,000 per year

    Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create...