
Experienced Cloud Reliability Engineer
2 days ago
**About the Role:**
We're looking for a seasoned Reliability Expert to take ownership of our cloud infrastructure's performance, efficiency, and availability.
The ideal candidate will define and implement reliability standards, lead high-impact projects, mentor engineers, and streamline operations to reduce toil.
This is an exciting opportunity to work at the forefront of cloud reliability and make a significant impact on our organization's success.
Main Responsibilities:- Service Level Agreements (SLAs): Develop and maintain customer-centric SLAs for critical services, ensuring alignment across teams.
- Error Budgeting: Establish and manage error budgets, implementing multi-window alerts, clear runbooks, and secure CI/CD gating.
- SLO/EB Dashboards: Design and maintain comprehensive dashboards in Azure Monitor, Grafana/Prometheus, and App Insights for SLO reviews with engineering/product teams.
- Roadmap Tradeoffs: Drive informed decision-making during budget risk situations by landing reliability epics.
- Incident Management: Lead SEV1/SEV2 incidents, own communications, conduct blameless postmortems, and ensure corrective actions are implemented.
- Cloud Reliability Engineering: Implement multi-AZ/region patterns, PDBs/Pod Topology Spread, HPA/VPA/KEDA, and resilient rollouts/rollbacks.
- Azure Kubernetes Service (AKS) Optimization: Harden clusters, optimize node/pod density, and enhance ingress security.
- Observability: Implement metrics/traces/logs with Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, and OpenTelemetry, focusing on symptom-based alerting.
- Infra as Code (IaC)/Automation: Leverage Terraform/Bicep modules, GitOps (Flux/Argo), and policy-as-code (Azure Policy/OPA Gatekeeper) for secure and scalable infrastructure management.
- CI/CD Reliability: Ensure reliable pipelines in Azure DevOps or GitHub Actions, incorporating canary/blue-green deployments, progressive delivery, auto-rollback, and Key Vault-backed secrets.
- Capacity/Performance Optimization:
- Data Protection: Define RTO/RPO, test backups/restore, conduct game days/chaos drills, and validate ASR and multi-region failover.
- Security: Implement Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, and shift-left checks in CI for secure by default practices.
- Toil Reduction: Automate recurring operations, build self-service runbooks/chatops, and publish golden paths for product teams.
- Customer Escalations: Serve as technical owner on calls, communicating tradeoffs and recovery plans with authority.
- Documentation: Maintain up-to-date architectures, runbooks, postmortems, and SLIs/SLOs to support scalability and knowledge sharing.
Requirements:
- Bachelor's degree in Computer Science/Engineering or equivalent experience.
- 12+ years of production ops/platform/SRE experience, including 5+ years on Azure.
- PostgreSQL: Deep operational expertise in HA/DR, logical/physical replication, performance tuning, autovacuum strategy, partitioning, backup/restore testing, and connection pooling.
- Azure core: AKS (must-have); Front Door/App Gateway, API Management, VNets/NSGs/Private Link, Storage, Key Vault, Redis, Service Bus/Event Hubs.
- Observability: Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana; SLO design and error-budget operations.
- IaC/automation: Terraform and/or Bicep; PowerShell and Python; GitOps (Flux/Argo). Pipelines in Azure DevOps or GitHub Actions.
- Proven incident leadership at scale, blameless postmortems, and SLO/error-budget governance with change gating.
- Mentorship and crisp written/verbal communication.
Preferred Qualifications:
- Apollo NiFi, Flink, Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter/replay patterns.
- Azure Solutions Architect Expert, CKA/CKAD.
- ITSM (ServiceNow), on-call tooling (PagerDuty/Opsgenie).
- Compliance/SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
- OpenTelemetry, eBPF tooling, or service mesh.
- Multi-tenant SaaS and cost optimization at scale.
-
Reliability Engineer for Scalable Systems
2 days ago
Anand, Gujarat, India beBeeObservability Full time ₹ 1,80,00,000 - ₹ 2,40,00,000Job Title: Site Reliability Engineer ObservabilityWe are seeking an experienced Site Reliability Engineer Observability to join our team. As a key member of our engineering organization, you will be responsible for designing and implementing scalable and highly available systems that meet the needs of our customers.The ideal candidate will have a strong...
-
Senior Site Reliability Engineer Lead
3 days ago
Anand, Gujarat, India beBeeSiteReliabilityEngineering Full time ₹ 15,00,000 - ₹ 22,00,000We are seeking an experienced and skilled Site Reliability Engineering Manager to join our team.As a key member of our site reliability engineering team, you will play a critical role in ensuring the reliability and scalability of our systems. This is a unique opportunity to shape the direction of our SRE function and contribute to the growth and development...
-
Senior Reliability Engineer
7 days ago
Anand, Gujarat, India beBeeBlockchain Full time ₹ 1,50,00,000 - ₹ 2,00,00,000Reliable Blockchain Infrastructure Engineer Our team is responsible for the architecture, deployment and maintenance of multi-cloud blockchain infrastructure.Ensure uptime, security and cost-efficiency of validator nodes across multiple blockchain networks.Deploy and maintain infrastructure on AWS, GCP and bare-metal to achieve protocol scalability and...
-
Site Reliability Engineering Leader
1 week ago
Anand, Gujarat, India beBeeDevops Full time ₹ 18,00,000 - ₹ 24,00,000Key Responsibilities:• Deliver high-quality services by designing, implementing, and maintaining infrastructure in cloud platforms.• Collaborate with cross-functional teams to identify opportunities for process improvements.• Ensure seamless operation of DevOps engineering processes, adhering to industry best practices.• Provide expert-level guidance...
-
Site Reliability Engineer Lead
1 week ago
Anand, Gujarat, India beBeeSystemReliability Full time ₹ 19,70,000 - ₹ 24,95,000Senior SRE ManagerWe are seeking a seasoned Site Reliability Engineering (SRE) leader to oversee the reliability, scalability, and performance of our systems.This position combines software engineering and systems engineering expertise to build and maintain high-performing, reliable systems. Key Responsibilities:Reliability & Performance:Lead efforts to...
-
Cloud Security Engineer
1 week ago
Anand, Gujarat, India beBeeCloudSecurity Full time ₹ 1,50,00,000 - ₹ 2,50,00,000Cloud Security Specialist Job DescriptionJob Title: Cloud Security SpecialistThis role requires experienced security professionals who want to make a significant impact on our cloud security posture.You will collaborate with our team to design and implement robust security controls across multiple cloud environments, including Google Cloud Platform (GCP).We...
-
Reliability & Performance Engineer
1 week ago
Anand, Gujarat, India beBeePerformance Full time ₹ 1,50,00,000 - ₹ 2,00,00,000Reliability & Performance EngineerThis role involves leading efforts to maintain high availability and reliability of critical services. The ideal candidate will define and monitor service level indicators, service level objectives, and service level agreements to ensure business requirements are met.Additionally, the engineer will proactively identify and...
-
Senior Cloud Engineer
3 days ago
Anand, Gujarat, India beBeeData Full time ₹ 1,80,00,000 - ₹ 2,50,00,000Unlock Data Insights as a Senior Cloud Engineer\We are seeking an experienced Senior Cloud Engineer to join our team. As a key contributor, you will be responsible for designing, developing, and deploying scalable data pipelines and ETL processes using Databricks.\Key Responsibilities:\Design and develop efficient data processing and storage solutions using...
-
Reliability and Finance Engineer
1 day ago
Anand, Gujarat, India beBeeAvailability Full time ₹ 1,50,00,000 - ₹ 3,00,00,000Reliability and Finance EngineerWe are seeking a skilled Reliability and Finance Engineer to join our team.The ideal candidate will have experience in Site Reliability Engineering, DevOps, or Production Engineering, ideally supporting financial or mission-critical applications.Key Responsibilities:Ensure Accounting and Finance platforms meet defined SLAs,...
-
Site Reliability Engineering Team Lead
3 days ago
Anand, Gujarat, India beBeeTechnical Full time US$ 1,20,000 - US$ 1,80,000**Job Summary:**The Site Reliability Engineering team is looking for a Technical Manager to lead a group of remote engineers. This role oversees day-to-day operations, provides technical mentorship, and ensures alignment with company objectives.Key Responsibilities:Manage the delivery of high-quality services and applications.Mentor and guide a team of site...