Site Reliability Engineer II
15 hours ago
Data is at the core of modern business, yet many teams struggle with its overwhelming volume and complexity. At Atlan, we're changing that. As the world's first active metadata platform, we help organisations transform data chaos into clarity and seamless collaboration.
From Fortune 500 leaders to hyper-growth startups, from automotive innovators redefining mobility to healthcare organisations saving lives, and from Wall Street powerhouses to Silicon Valley trailblazers — we empower ambitious teams across industries to unlock the full potential of their data.
Recognised as leaders by Gartner and Forrester and backed by Insight Partners, Atlan is at the forefront of reimagining how humans and data work together. Joining us means becoming part of a movement to shape a future where data drives extraordinary outcomes.
Why this role mattersAs a key member of Atlan's Platform & Reliability Engineering Team, your core responsibility will be to strengthen our alert management and incident response capabilities, ensuring every customer experience remains fast, reliable, and uninterrupted.
Whether you're handling production incidents, automating operational workflows, or enhancing observability and monitoring, your work will directly contribute to Atlan's mission of empowering modern data teams with a resilient and seamless platform.
At Atlan, we're building high-performance, reliability-driven engineering teams across every function — and this role is foundational. We're looking for curious, self-driven engineers who thrive under pressure, love solving real-world reliability challenges, and are passionate about keeping systems stable as we scale globally.
We value engineers who use data, automation, and deep systems thinking to make reliability a core part of how we build and operate not just a function, but a culture.
Your Mission at AtlanOwn and operate end-to-end reliability for critical systems — from alert triage and incident resolution to long-term preventive improvements.
Proactively manage incidents within defined SLAs (60 mins for Critical, 180 mins for High) and ensure smooth collaboration across teams during resolution.
Enhance observability by improving monitoring systems, refining alerts, and reducing noise to focus on what truly matters.
Automate operations and incident workflows to eliminate manual toil, improving speed, consistency, and reliability.
Collaborate across teams — work with Platform, Observability, and Product Engineering teams to strengthen uptime and service stability.
Contribute to documentation and playbooks, ensuring that every incident drives learning, process improvement, and team efficiency.
Proven experience managing alerts, incidents, and root cause analyses in production environments.
Hands-on knowledge of cloud platforms (AWS, GCP, or Azure) and Kubernetes — including networking, deployments, and troubleshooting.
Familiarity with monitoring and observability tools such as Prometheus, Grafana, ELK/EFK, or Datadog.
Ability to automate repetitive operational tasks using scripting (Python, Bash, or Shell).
Strong communication and collaboration skills — especially in distributed or remote-first teams.
A mindset of ownership, curiosity, and calm under pressure — you thrive in incident response and turn challenges into learning opportunities.
Real impact from Day 1: Your work directly shapes reliability for thousands of users across the globe.
Modern tech stack: Work with cutting-edge tools — Kubernetes, Terraform, Prometheus, Datadog, and more.
Learning culture: Collaborate with world-class platform engineers and senior SREs who believe in mentorship and continuous growth.
Autonomy & trust: Freedom to experiment, improve, and own your work end-to-end.
Clear growth path: Grow from SRE II → Senior SRE → Senior SRE II → Staff SRE → Principal SRE as you expand your technical depth and ownership scope.
Help build the backbone of Atlan's global data platform.
Turn reactive operations into proactive reliability.
Be part of a culture that treats reliability not as a checklist — but as a craft.
Why Atlan for You?
At Atlan, we believe the future belongs to the humans of data. From curing diseases to advancing space exploration, data teams are powering humanity's greatest achievements. Yet, working with data can be chaotic—our mission is to transform that experience. We're reimagining how data teams collaborate by building the home they deserve, enabling them to create winning data cultures and drive meaningful progress.
Joining Atlan means:
Ownership from Day One: Whether you're an intern or a full-time teammate, you'll own impactful projects, chart your growth, and collaborate with some of the best minds in the industry.
Limitless Opportunities: At Atlan, your growth has no boundaries. If you're ready to take initiative, the sky's the limit.
A Global Data Community: We're deeply embedded in the modern data stack, contributing to open-source projects, sponsoring meet-ups, and empowering team members to grow through conferences and learning opportunities.
As a fast-growing, fully remote company trusted by global leaders like Cisco, Nasdaq, and HubSpot, we're creating a category-defining platform for data and AI governance. Backed by top investors, we've achieved 7X revenue growth in two years and are building a talented team spanning 15+ countries.
If you're ready to do your life's best work and help shape the future of data collaboration, join Atlan and become part of a mission to empower the humans of data to achieve more, together.
We are an equal opportunity employer
At Atlan, we're committed to helping data teams do their lives' best work. We believe that diversity and authenticity are the cornerstones of innovation, and by embracing varied perspectives and experiences, we can create a workplace where everyone thrives. Atlan is proud to be an equal opportunity employer and does not discriminate based on race, color, religion, national origin, age, disability, sex, gender identity or expression, sexual orientation, marital status, military or veteran status, or any other characteristic protected by law.
-
Apply Now! Site Reliability Engineer II
1 week ago
India Microsoft Full timeJob Description The Production Engineering and Artificial Intelligence (AI) Group, part of the Linux Systems Group within Microsoft, plays a critical role in powering Azure Cloud. This team ensures that Azure operates with the latest version of Linux software at the highest levels of quality and performance, serving as the gatekeeper for production software....
-
Site Reliability Engineer II
4 weeks ago
Chennai, India Trimble Inc. Full timeJob Description Your Title: Site Reliability Engineer -II Job Location: Chennai, India Our Department: Trimble Platform Are you interested in cutting edge cloud technologies, ready to dirt your hands in the cloud world Do you like to be part of a core team with industry leading site reliability engineering standards About The Role Are you passionate about...
-
Site Reliability Engineer II
17 hours ago
India Akamai Full time ₹ 12,00,000 - ₹ 24,00,000 per yearDo you like collaborating across teams to solve complex problems?Do you enjoy solving large scale systems problems?Join our Zero Trust Security TeamAkamai is a leading developer of a distributed platform for cloud computing, security, and content delivery. At SIA Enterprise, we develop protective measures that harness Akamai's real-time cloud security...
-
Site Reliability Engineer II
7 days ago
India Atlan Full time ₹ 5,00,000 - ₹ 12,00,000 per yearData is at the core of modern business, yet many teams struggle with its overwhelming volume and complexity. At Atlan, we're changing that. As the world's first active metadata platform, we help organisations transform data chaos into clarity and seamless collaboration. From Fortune 500 leaders to hyper-growth startups, from automotive innovators redefining...
-
Site Reliability Engineer
5 days ago
Hyderabad, India LivePerson Full timeJob Description LivePerson (NASDAQ: LPSN) is a leading customer engagement company, creating digital experiences powered by Curiously Human AI. Every person is unique, and our technology makes it possible for companies, including leading brands like HSBC, Orange, and GM Financial, to treat their audiences that way at scale. Nearly a billion conversational...
-
Noida, India BOLD Full timeJob Description BOLD is seeking professionals who will be responsible for performing the build and release activities with Microsoft Technology stack. This person will also manage CI/CD pipelines and automate the build and deployment process. He/she will also work collaboratively with different teams including Dev, QA, and infrastructure. Job Description...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, India Relanto Full timeJob Description Job Title: Site Reliability Engineer Summary We are looking for a Site Reliability Engineer to join our Digital & Transformation department. The ideal candidate will have 2-3 years of experience in this field and will be responsible for ensuring the reliability, availability, and performance of our systems and applications. Roles And...
-
Site Reliability Engineer
4 weeks ago
, India, IN Sonata Software Full timeWe're Hiring: Senior Site Reliability Engineer Location: Onsite (Office: Hyderabad – Mandatory from Day 1) Employment Type: Full-time Notice Period: Immediate to 15 Days Only Experience: 8+ Years About the RoleWe’re looking for a Senior Site Reliability Engineer (SRE) to lead reliability initiatives across our production systems. This is a high-impact...
-
Site Reliability Engineer
5 days ago
India Akamai Technologies Full timeJob Description Job Description Do you like collaborating across teams to solve complex problems Do you enjoy solving large scale distributed content delivery challenges Join our highly skilled Compute Site Reliability team Our team designs, develops, and manages applications and infrastructure that support Akamai's Compute products and services. We...
-
Site Reliability Engineer
3 days ago
India Akamai Full time ₹ 8,00,000 - ₹ 24,00,000 per yearDo you like collaborating across teams to solve complex problems?Do you enjoy solving large scale distributed content delivery challenges?Join our highly skilled Compute Site Reliability teamOur team designs, develops, and manages applications and infrastructure that support Akamai's Compute products and services. We specialize in creating solutions that...