Site Reliability Engineering Manager
2 weeks ago
SRE CORE Manager/Lead/Director
Key Responsibilities:
Strategic Leadership & Vision:
· Lead and manage the Software Release Management function for all Data and AI products.
· Establish a centralized release management framework for AI and data products that scales with the growing product portfolio.
· Form and lead a high-performing Site Reliability Engineering (SRE) team to ensure the operational stability and performance of all AI and data-driven applications post-release.
· Collaborate with Product, Engineering and Operations teams to align release and SRE strategies with business objectives.
Release Planning & Coordination:
· Oversee the full lifecycle of software and AI model releases, from planning and coordination to post-release evaluation.
· Develop and maintain a detailed release calendar that aligns with the timelines and priorities of various product teams.
· Coordinate release activities with multiple cross-functional teams, ensuring transparent communication of dependencies, risks, and milestones.
· Ensure that all releases are integrated seamlessly into production, minimizing downtime and disruptions to end users.
Site Reliability Engineering (SRE) Team Formation:
· Hire, build, and lead the SRE team responsible for maintaining the reliability, scalability, and performance of all Data and AI products in production.
· Define the roles and responsibilities of the SRE team, ensuring clear alignment with the goals of product engineering and release management.
· Develop and implement SRE best practices, including incident response, root cause analysis, and proactive performance monitoring.
· Establish SLAs, SLOs, and SLIs (Service Level Agreements/Objectives/Indicators) to track and measure the reliability and performance of all services post-release.
· Collaborate with DevOps to ensure that automated CI/CD pipelines integrate seamlessly with SRE processes and monitoring systems.
Process Optimization & Automation:
· Lead the automation of software release processes, with an emphasis on CI/CD pipelines for AI models, data pipelines, and cloud-based AI products.
· Develop infrastructure-as-code practices to improve the scalability and reliability of AI and data systems across production environments.
· Introduce tools for version control, model governance, and monitoring for MLOps and AI model management in production.
· Continuously improve operational procedures to reduce the number of incidents and optimize recovery time.
Risk & Quality Management:
· Implement comprehensive quality assurance and validation processes to ensure that all AI models, data products, and software releases meet security, performance, and compliance requirements.
· Proactively identify and mitigate risks related to releases, AI model performance, and operational stability in production.
· Conduct post-release reviews and retrospectives to continuously improve both the release process and the reliability of products.
Collaboration & Stakeholder Management:
· Serve as the central point of contact for release management and SRE-related matters, ensuring consistent communication between engineering, product teams, and key stakeholders.
· Facilitate cross-functional collaboration to ensure that releases and operational reliability goals are met efficiently and effectively.
· Provide regular updates on release progress, system reliability, and any potential risks to executives and product leadership.
Innovation & Continuous Improvement:
· Stay up to date with the latest trends in SRE, DevOps, AI/ML, and cloud operations, incorporating new tools and practices to improve the overall reliability and release processes.
· Drive the adoption of cutting-edge tools in MLOps, AI model deployment, and automated incident resolution to continuously optimize operations and model lifecycle management.
· Foster a culture of continuous improvement by encouraging feedback loops and metrics-driven decision-making across both the release management and SRE teams.
---
Qualifications:
· Bachelor’s or Master’s degree in Computer Science, Data Engineering, AI/ML, or a related field.
· 10+ years of experience in software release management, with at least 3-5 years in SRE or DevOps environments, preferably in AI or data-driven applications.
· Proven experience building and managing both release management and SRE teams in complex, multi-product environments.
· Strong knowledge of AI/ML operations (MLOps), data pipeline management, and cloud-based AI product deployments.
· Expertise in release management tools (Jenkins, GitLab, Git, Jira) and SRE tools such as Prometheus, Grafana, Datadog, or similar monitoring systems.
· Experience with cloud platforms (AWS, GCP, Azure), containerization (Kubernetes, Docker), and infrastructure automation tools (Terraform, Ansible).
· Excellent problem-solving, organizational, and leadership skills, with a strong track record of driving continuous improvement in both release and operational reliability processes.
Preferred Qualifications:
· Experience deploying and maintaining large-scale AI/ML models in production environments, including monitoring, retraining, and operationalization.
· Familiarity with ITIL, MLOps, or DevOps frameworks and best practices.
· Knowledge of cloud-based services and tools specifically designed for AI/ML (e.g., AWS SageMaker, TensorFlow, PyTorch).
· Demonstrated ability to manage incident response and root cause analysis in complex software ecosystems.
-
Site Reliability Engineer
2 weeks ago
tamil nadu, India Tata Consultancy Services Full timeTCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and there’s nothing that can stop us from growing together. What we are looking for Role: Site Reliability Engineer Experience Range: 8 – 12 Years Location: Pune & Chennai, Bangalore , Delhi Must-Have: Essential: Exceptional...
-
Site Reliability Engineer
2 weeks ago
tamil nadu, India Tata Consultancy Services Full timeTCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and there’s nothing that can stop us from growing together.What we are looking forRole: Site Reliability EngineerExperience Range: 8 – 12 YearsLocation: Pune & Chennai, Bangalore , DelhiMust-Have:Essential:Exceptional skills in...
-
Site Reliability Engineering Manager
2 weeks ago
tamil nadu, India Centific Full timeCentific is a Seattle-based tech company pioneering the future of AI one breakthrough at a time. Learn how we’re transforming the world through safe and scalable AI and empowering businesses to unlock the full potential of their data. SRE CORE Manager/Lead/Director Key Responsibilities: Strategic Leadership & Vision: · Lead and manage the Software...
-
Site Reliability Engineer
2 weeks ago
tamil nadu, India Viasat Full timeAbout us One team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an...
-
Site Reliability Engineer
2 weeks ago
tamil nadu, India Viasat Full timeAbout usOne team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an...
-
Senior Platform Manager
4 weeks ago
Chennai/Tamil Nadu, India MX Build Technologies Full timeAbout the RoleWe are seeking a seasoned Senior Manager of Platform Automation and Site Reliability Engineering (SRE) to join our team at MX Build Technologies. As a key member of our technology organization, you will be responsible for leading and scaling our platform automation and SRE initiatives.Key Responsibilities:Lead a team of platform engineers,...
-
Site Reliability Engineer
3 weeks ago
tamil nadu, India Reflections Info Systems Full timeIntroduction We are looking for 3+years experienced candidates for this role Responsibilities include: - Work closely with the application support team. - Monitor critical applications and services to minimize downtime and ensure their availability. - Collaborate with DevOps teams to maintain and monitor CI/CD pipelines. - Deploy new versions to...
-
Site Reliability Engineer
3 weeks ago
tamil nadu, India Reflections Info Systems Full timeIntroductionWe are looking for 3+years experienced candidates for this roleResponsibilities include:- Work closely with the application support team.- Monitor critical applications and services to minimize downtime and ensure their availability.- Collaborate with DevOps teams to maintain and monitor CI/CD pipelines.- Deploy new versions to production...
-
Site Reliability Engineer
3 weeks ago
tamil nadu, India Reflections Info Systems Full timeIntroductionWe are looking for 3+years experienced candidates for this roleResponsibilities include:Work closely with the application support team.Monitor critical applications and services to minimize downtime and ensure their availability.Collaborate with DevOps teams to maintain and monitor CI/CD pipelines.Deploy new versions to production...
-
tamil nadu, India Centific Full timeCentific is a Seattle-based tech company pioneering the future of AI one breakthrough at a time. Learn how we’re transforming the world through safe and scalable AI and empowering businesses to unlock the full potential of their data.Head / Director SREKey Responsibilities:Strategic Leadership & Vision:· Lead and manage the Software Release Management...
-
Site Supervisor
6 months ago
Chennai, Tamil Nadu, India Buildfic Engineering Private ltd Full time**Site supervisor Responsibilities**: - Inspecting construction sites regularly to identify and eliminate potential safety hazards. - Supervising and instructing the construction team as well as subcontractors. - Educating site workers on construction safety regulations and accident protocol. - Enforcing site safety rules to minimize work-related accidents...
-
Site Reliability Engineer
2 months ago
Chennai, Tamil Nadu, India Spruce IT Pvt. Ltd. Full timeJob Opportunity : SRE Engineer at Spruce InfoTech, Inc. Position : SRE Engineer. Location : Chennai. Experience : 8+ years. Job Type : Long-term C2H (Contract-to-Hire). Job Description : Spruce InfoTech, Inc. is seeking a skilled SRE Engineer to join our team in Chennai.- The ideal candidate will have extensive experience with OpenShift Cluster, Linux,...
-
Site Supervisor
3 months ago
Oragadam, Chennai, Tamil Nadu, India Zoe Engineering Full time**Site SupervisorRoles**: - **On-Site Management**: Directly oversee daily operations on the construction site. - **Team Leadership**: Manage and guide the construction crew and subcontractors. - **Quality and Safety Oversight**: Ensure that work is completed to quality standards and safety regulations. **Responsibilities**: - **Project Oversight**: -...
-
Chennai/Tamil Nadu, Tamil Nadu, India MX Build Technologies Full timeLife at MX :We are driven by our moral imperative to advance mankind - and it all starts with our people, product and purpose. We always carry a deep sense of drive and passion with us. If you thrive in a challenging work environment, surrounded by incredible team members who will help you grow, MX is the right place for you.Come build with us and be part of...
-
Site Civil Engineer
2 weeks ago
tamil nadu, India MERIT LEAGUE (Civil and Interior Contractors) Full timeAbout Us:Merit League Civil and Interior Contractors is a leading construction company committed to delivering high-quality infrastructure projects. Our team is dedicated to excellence, innovation, and sustainability. We pride ourselves on our expertise in both civil and interior contracting, ensuring comprehensive solutions for our clients.For more...
-
Site Foreman
5 months ago
Palavakkam, Chennai, Tamil Nadu, India Mascons Engineering & Full time**Requirement**: Experience : Minimum 8 Years in same field **Site Foreman Responsibilities**: **Site Management**: - Overseeing all aspects of the construction site, including safety, quality, and productivity **Work Crew Supervision**: - Managing construction workers, subcontractors, and laborers, assigning tasks, and ensuring that they follow safety...
-
Site Civil Engineer
2 weeks ago
tamil nadu, India Sepam Full timeEstablished in 1976, SEPAM is an expert, global engineering and full service project management firm. SEPAM forged its reputation for excellence in the heavy industrial and Oil & Gas engineering sectors. Today it has evolved into a world-class provider of solutions and services to the Energy, ICT, Life Sciences and Advanced Technology sectors (among others)...
-
Site Reliability Engineer
3 weeks ago
tamil nadu, India Altimetrik Full timeJob Description: Around 3-8 years of SRE hands on exposure with Troubleshooting & Triaging, development, SRE toolsets and automation Key Responsibilities: Mandatory Skills - Should have development (Java, .Net or Python) background with strong code handling capabilities Or at least should have supporting experience (L3) of applications and scaled...
-
Site Reliability Engineer
3 weeks ago
tamil nadu, India Altimetrik Full timeJob Description:Around 3-8 years of SRE hands on exposure with Troubleshooting & Triaging, development, SRE toolsets and automationKey Responsibilities:Mandatory Skills- Should have development (Java, .Net or Python) background with strong code handling capabilities Or at least should have supporting experience (L3) of applications and scaled infrastructure...
-
Site Reliability Engineer
3 weeks ago
tamil nadu, India Altimetrik Full timeJob Description:Around 3-8 years of SRE hands on exposure with Troubleshooting & Triaging, development, SRE toolsets and automationKey Responsibilities:Mandatory SkillsShould have development (Java, .Net or Python) background with strong code handling capabilities Or at least should have supporting experience (L3) of applications and scaled infrastructure...