Manager, Site Reliability Engineering

3 weeks ago

Kanpur, Uttar Pradesh, India Palo Alto Networks Full time

Our Mission

At Palo Alto Networks everything starts and ends with our mission:

Being the cybersecurity partner of choice, protecting our digital way of life.

Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we're looking for innovators who are as committed to shaping the future of cybersecurity as we are.

Who We Are

We take our mission of protecting the digital way of life seriously. We are relentless in protecting our customers and we believe that the unique ideas of every member of our team contributes to our collective success. Our values were crowdsourced by employees and are brought to life through each of us everyday - from disruptive innovation and collaboration, to execution. From showing up for each other with integrity to creating an environment where we all feel included.

As a member of our team, you will be shaping the future of cybersecurity. We work fast, value ongoing learning, and we respect each employee as a unique individual. Knowing we all have different needs, our development and personal wellbeing programs are designed to give you choice in how you are supported. This includes our FLEXBenefits wellbeing spending account with over 1,000 eligible items selected by employees, our mental and financial health resources, and our personalized learning opportunities - just to name a few

At Palo Alto Networks, we believe in the power of collaboration and value in-person interactions. This is why our employees generally work full time from our office with flexibility offered where needed. This setup fosters casual conversations, problem-solving, and trusted relationships. Our goal is to create an environment where we all win with precision.

Job Description

Your Career

We're seeking an experienced Cloud SRE lead to lead high-severity incident and problem management across our GCP-centric platforms. This role combines deep technical troubleshooting with process ownership, ensuring rapid recovery, root cause elimination, and long-term reliability improvements. You will own L3 OnCall responsibilities, drive post-incident learning, and champion automation and operational excellence.

Implement and lead post-mortem processes within SLAs, identify root causes, and drive corrective actions to reduce repeat incidents.

Your Impact :

In your technical and leadership capacity you will contribute to a seamless production site reliability operations , partnering closely with regional and global SRE counterparts with special attention to the below
Incident Analysis & Problem Management: Implement and lead post-mortem processes within SLAs, identify root causes, and drive corrective actions to reduce repeat incidents. Establish and maintain a problem backlog, ensuring timely resolution and continuous process improvement.
Troubleshooting: Rapidly diagnose and resolve failures across Kubernetes, Terraform, and GCP using advanced troubleshooting frameworks.
Preventative Measures: Implement automation and enhanced monitoring to proactively detect issues and reduce incident frequency.
Stakeholder Communication: Work with GCP / AWS TAMs and othre vendors to request new features or followups for updates.
Mentorship: Coach and elevate SRE and DevOps teams, promoting best practices in reliability and incident/problem management.
Documentation: Establish and maintain a problem backlog, ensuring timely resolution and continuous process improvement.
Envision the future or SRE with AI/ML : Ability to envision how a modern SRE team should operate leveraging AI/ML

Qualifications

Your Experience

12+ years of experience in SRE/DevOps/Infrastructure roles, with a strong foundation in cloud-based environments.
5+ years of proven experience managing SRE/DevOps teams, preferably with a strong focus on Google Cloud Platform (GCP).
Deep hands-on knowledge of Terraform, Kubernetes (GKE), GitLab CI/CD, and modern observability practices (e.g., Prometheus, OpenTelemetry).
Strong experience in managing incident response and postmortems, reducing MTTR, and driving proactive reliability improvements.
Proficiency with cloud platforms such as GCP & AWS.
Solid grasp of Infrastructure as Code, container orchestration, and scalable cloud architectures.
Track record of building tools for system reliability, automated remediation, and performance tuning.
Experience leveraging AI/ML-based operations tools for automation, anomaly detection, and predictive alerting is a plus.
Expertise in SLI/SLO/SLA design and implementation, and driving operational maturity through data.
Strong interpersonal and leadership skills, with a demonstrated ability to coach, mentor, and inspire teams.
Effective communicator, capable of translating complex technical concepts to non-technical stakeholders.
Committed to inclusion, collaboration, and creating a culture where every voice is heard and respected.

Additional Information

The Team

To stay ahead of the curve, it's critical to know where the curve is, and how to anticipate the changes we're facing. For the fastest-growing cybersecurity company, the curve is the evolution of cyberattacks and access technology and the products and services that dedicatedly address them. Our engineering team is at the core of our products – connected directly to the mission of preventing cyberattacks and enabling secure access to all on-prem and cloud applications. They are constantly innovating – challenging the way we, and the industry, think about Access and security. These engineers aren't shy about building products to solve problems no one has pursued before. They define the industry, instead of waiting for directions. We need individuals who feel comfortable in ambiguity, excited by the prospect of challenge, and empowered by the unknown risks facing our everyday lives that are only enabled by a secure digital environment.

Our engineering team is provided with an unrivaled chance to create the products and practices that will support our company growth over the next decade, defining the cybersecurity industry as we know it. If you see the potential of how incredible people and products can transform a business, this is the team for you. If the prospect of affecting tens of millions of people, enabling them to work remotely securely and easily in ways never done before, thrill you - you belong with us.

Our Commitment

We're problem solvers that take risks and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together.

We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at accommodations@paloaltonetworks.com.

Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics.

All your information will be kept confidential according to EEO guidelines.

Sr. Lead Site Reliability Engineer – Technical

3 weeks ago

Kanpur, Uttar Pradesh, India Shell Recharge Solutions Full time

Shell Recharge Solutions is looking for a Sr. Lead Site Reliability Engineer + People/ Team management to join our team. We would like to find a highly engaged engineer who is obsessed with monitoring, observability, code quality and self-healing infrastructures with Team management You should be able to identify, troubleshoot, and resolve issues quickly...
Hot Tapping Engineer

2 days ago

Kanpur, Uttar Pradesh, India On Stream Engineering Services Full time

Company DescriptionOn Stream Engineering Services is a leading service provider for hot tapping and pipeline isolation to all major oil and gas companies in India. Additionally, we manufacture hot tapping split tee fittings for pipelines. Our company is dedicated to delivering reliable and innovative solutions to meet the needs of our clients in the oil and...
Reliable Systems Expert

11 hours ago

Kanpur, Uttar Pradesh, India beBeeResponsibility Full time ₹ 18,00,000 - ₹ 26,40,000

Job OverviewThis is a key position for a skilled Site Reliability Engineer to join our team.Experience working with microservices on a Kubernetes background and possessing a strong understanding of observability tools and metrics.
Reliability Engineering Specialist

3 days ago

Kanpur, Uttar Pradesh, India beBeeEngineering Full time ₹ 1,50,00,000 - ₹ 2,00,00,000

Job Title: Reliability Engineering SpecialistWe are seeking an experienced professional to join our Platform Engineering team as a Reliability Engineering Specialist. The ideal candidate will have a strong background in software engineering and systems operations, with expertise in building infrastructure that powers AI-driven code reviews at scale.Main...
Chief Cloud Reliability Architect

2 days ago

Kanpur, Uttar Pradesh, India beBeeCloud Full time ₹ 80,00,000 - ₹ 1,50,00,000

Job OpportunityWe are seeking a highly skilled individual to design, build, and validate resilient, scalable, and automated cloud-native environments as a Site Reliability Engineer.
Senior Reliability Engineer Position

12 hours ago

Kanpur, Uttar Pradesh, India beBeeEngineer Full time ₹ 23,00,000 - ₹ 25,00,000

Reliability Engineering Leadership Role">We are seeking a seasoned Reliability Engineer to lead our team's efforts in ensuring the availability and performance of our systems. As a technical leader, you will be responsible for solving complex production issues, guiding development teams, and building tools that improve system resilience and...
AI/ML System Reliability Engineer

14 hours ago

Kanpur, Uttar Pradesh, India beBeeSiteReliability Full time ₹ 13,04,000 - ₹ 26,12,000

Transform Your Career with AI/ML Site ReliabilityWe seek an experienced professional to ensure the reliability and scalability of cloud-based AI/ML systems.Key Responsibilities:Design, implement, and maintain scalable and reliable Azure infrastructure (storage, networking, security, IAM)Collaborate with cross-functional teams to develop and deploy Databricks...
Senior System Reliability Engineer

12 hours ago

Kanpur, Uttar Pradesh, India beBeeMonitoring Full time ₹ 18,00,000 - ₹ 20,00,000

System Health MonitorThe Insight Global team is hiring a full-time Monitoring Engineer to join the LLM Proxy Team. This role involves monitoring system health via Grafana dashboards, managing incident communications, and ensuring high reliability of globally deployed web applications.Key Responsibilities:Monitor Grafana dashboards and observability tools to...
Dedicated Site Engineer for Office Interiors Project

11 hours ago

Kanpur, Uttar Pradesh, India beBeeSiteEngineer Full time ₹ 12,00,000 - ₹ 15,00,000

Job Title: Site Engineer-Office InteriorsWe are seeking a dedicated and experienced site engineer to oversee office interior construction projects from start to finish. The ideal candidate will possess strong technical expertise, excellent project management skills, and a commitment to delivering high-quality results.Key Responsibilities:Project Planning:...
Supplier Quality and Management Engineer

1 week ago

Kanpur, Uttar Pradesh, India Expleo Full time ₹ 9,00,000 - ₹ 12,00,000 per year

OverviewLooking For highly motivated Engineer with Supplier Quality and Management skills. The Engineer is responsible for ensuring that all materials, components, and services provided by external suppliers meet the company's quality standards and regulatory requirements. This role is essential in evaluating, developing, and managing suppliers to improve...

Americas

Europe

Asia / Oceania

Africa

Manager, Site Reliability Engineering