
Senior Cloud Reliability Engineer
17 hours ago
We are seeking an experienced Senior Reliability Engineer to join our team. The successful candidate will be responsible for ensuring the high availability, low latency, and optimal performance of our SaaS platform on Azure.
About the RoleAs a Senior Reliability Engineer, you will play a key role in defining and enforcing reliability standards, leading high-impact projects, mentoring engineers, and eliminating toil at scale. You will work closely with the Director of SRE to achieve these goals.
Key Responsibilities- Define Customer-Centric SLIs/SLOs: Develop and publish customer-centric Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for Tier-0/Tier-1 services, review quarterly, and align teams to them.
- Run Error Budget Policy: Implement a multi-window, multi-burn-rate error budget policy with clear runbooks and paging thresholds.
- Gate Changes: Integrate change gate functionality into CI/CD, freezing or relaxing rules as needed.
- Maintain SLO/EB Dashboards: Maintain Service Level Objective (SLO) and error budget (EB) dashboards using Azure Monitor, Grafana/Prometheus, and App Insights. Conduct weekly SLO reviews with engineering and product teams.
- Drive Roadmap Tradeoffs: Make data-driven decisions when budgets are at risk, land reliability epics, and drive roadmap tradeoffs.
- Incident Management: Lead SEV1/SEV2 incidents without drama, own comms, run blameless postmortems, and ensure corrective actions stick.
- Engineer Resiliency: Design and implement multi-AZ/region patterns (active-active/DR), PDBs/Pod Topology Spread, HPA/VPA/KEDA, resilient rollout/rollback strategies.
- Azure Kubernetes Service (AKS): Harden clusters (network, identity, policy), optimize node/pod density, ingress (AGIC/Nginx), and mesh optional configurations.
- Observability: Implement metrics/traces/logs using Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, and OpenTelemetry. Alert on symptoms, not noise.
- IaC & Policy: Implement Terraform/Bicep modules, GitOps (Flux/Argo), policy-as-code (Azure Policy/OPA Gatekeeper) solutions, and eliminate snowflakes.
- CI/CD Reliability: Implement Azure DevOps/GitHub Actions with canary/blue-green, progressive delivery, auto-rollback, and Key Vault-backed secrets.
- Capacity & Performance: Partner with FinOps to right-size resources, autoscale, conduct load testing, and reduce spend without hurting SLOs.
- Disaster Recovery: Define RTO/RPO, test backups/restore, run game days/chaos drills, and validate ASR and multi-region failover.
- Security: Implement Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, and shift-left checks in CI.
- Toil Reduction: Automate recurring ops, build self-service runbooks/chatops, publish golden paths for product teams, and reduce toil.
- Customer Escalations: Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
- Documentation: Maintain up-to-date architectures, runbooks, postmortems, SLIs/SLOs, and other critical documents.
Join our team and become part of a dynamic group of professionals who are passionate about delivering high-quality solutions. Enjoy a collaborative and supportive environment that fosters growth and development.
RequirementsTo be successful in this role, you will need:
- Strong Technical Skills: In-depth knowledge of cloud computing, containerization, and orchestration technologies.
- Excellent Communication Skills: Ability to effectively communicate complex technical concepts to both technical and non-technical stakeholders.
- Leadership Skills: Proven ability to lead high-impact projects, mentor engineers, and drive business outcomes.
If you are a motivated and detail-oriented individual with a passion for delivering high-quality solutions, we encourage you to apply for this exciting opportunity.
-
Cloud Site Reliability Engineer
5 days ago
Hyderabad, Telangana, India Careernet Full time ₹ 1,04,000 - ₹ 1,30,878 per yearKey Skills: Cloud, Kubernetes, Python, Jenkins, OpenTelemetry, AppDynamics, Site Reliability Engineer.Roles & Responsibilities:Design, implement, and manage cloud infrastructure to ensure high availability and reliability.Utilize Kubernetes for container orchestration and management.Develop and maintain monitoring solutions using OpenTelemetry and...
-
Senior Site Reliability Engineer
6 days ago
Hyderabad, Telangana, India Microsoft Full timeThe Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level Windows 365 Cloud PC W365 and Azure Virtual Desktop AVD have recently been recognized as leaders in the Gartner Magic Quadrant TM for...
-
Cloud Reliability Specialist
1 week ago
Hyderabad, Telangana, India beBeeAzureSre Full time ₹ 15,00,000 - ₹ 25,00,000Reliable Cloud Engineer RoleThis is a key role that ensures the reliability, scalability, and security of cloud services.Responsibilities:Monitor and troubleshoot cloud infrastructure and applicationsCollaborate with cross-functional teams to resolve issues and implement improvementsDevelop and maintain cloud resources and automation scriptsPerform capacity...
-
Reliability Engineering Manager
2 days ago
Hyderabad, Telangana, India beBeeEngineering Full time ₹ 15,00,000 - ₹ 20,00,000Job Title: Reliability Engineering ManagerThis senior-level position oversees the establishment and implementation of organizational reliability strategies, aligning Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Error Budgets with business goals and customer expectations.The ideal candidate will lead technical reviews for...
-
Hyderabad, Telangana, India beBeeReliability Full time US$ 1,03,633 - US$ 1,56,993Job Title: Senior Reliability Engineering Specialist">Job Summary:">We are seeking a senior reliability engineering specialist to join our observability team. As a generalist, you will collaborate with product development teams, cloud infrastructure, and other SRE teams to ensure effective observability and improve the reliability of our products and...
-
Senior Cloud Engineer
5 days ago
Hyderabad, Telangana, India ConnectedX Inc. Full time ₹ 1,04,000 - ₹ 1,30,878 per yearWe're Hiring at ConnectedX Inc. - Senior Cloud EngineerAre you passionate about cloud-native applications, automation, and cutting-edge infrastructure?We're looking for a Senior Cloud Engineer to join our team and drive innovation in our Enterprise Platform & Automation domain.Location: HyderabadWhat you'll do: Implement strategic technology roadmaps &...
-
Senior Site Reliability Engineer
1 week ago
Hyderabad, Telangana, India Microsoft Full time ₹ 9,00,000 - ₹ 12,00,000 per yearThe Windows Cloud division is looking for a Senior Site Reliability Engineer that will help us take the Windows Cloud platform, as well as the Windows 365 Cloud PC and Azure Virtual Desktop business to the next level.Windows 365 Cloud PC (W365) and Azure Virtual Desktop (AVD) have recently been recognized as leaders in the Gartner Magic Quadrant for Desktop...
-
Cloud-Native Reliability Engineer
1 week ago
Hyderabad, Telangana, India beBeeResilience Full time ₹ 2,00,00,000 - ₹ 2,50,00,000Reliability Engineering Lead\rDrive the development of resilient systems and processes that deliver high-quality experiences for users. As a key member of our global SRE practice, you will support 260+ cloud-native applications across diverse functions.\rPrevent incidents before they occur, ensure rapid recovery when they do, and build scalable systems that...
-
Site Reliability Engineer
2 days ago
Hyderabad, Telangana, India Jigya Software Services Full time ₹ 1,50,000 - ₹ 28,00,000 per yearJob Title:Senior Site Reliability Engineer (SRE) - AWS/KubernetesLocation:Hyderabad - OnsiteJob Type:Full-TimeAbout the Role:We are looking for a highly skilled and motivated Site Reliability Engineer to design, build, and maintain our high-performance, scalable cloud infrastructure. You will play a critical role in ensuring the reliability, performance, and...
-
Senior Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India CloudHire Full time ₹ 7,00,000 - ₹ 12,00,000 per yearJob SummaryThe Technical Manager for Site Reliability Engineering (SRE) will lead a remote team of Site Reliability Engineers, ensuring operational excellence and fostering a high-performing team culture. Reporting to the US-based Director of Systems and Security, this role is responsible for overseeing day-to-day operations, technical mentorship, and...