Senior Reliability Engineer

2 days ago


Ahmedabad, Gujarat, India beBeeReliability Full time US$ 2,00,000 - US$ 2,50,000

Senior Reliability Engineer

We are seeking an experienced Senior Reliability Engineer to join our team. In this role, you will be responsible for ensuring the high availability, performance, and efficiency of our SaaS application on Azure. This includes defining and enforcing reliability standards, leading high-impact projects, mentoring engineers, and eliminating toil at scale.

Key Responsibilities:

  • Reliability Standards: Define customer-centric SLIs/SLOs for Tier-0/Tier-1 services and publish quarterly reviews. Align teams with these standards to ensure everyone is working towards the same goals.
  • Error Budgeting: Implement error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds. Gate changes by budget status (freeze/relax rules) wired into CI/CD.
  • SLO/EB Dashboards: Maintain SLO/EB dashboards using Azure Monitor, Grafana/Prometheus, and App Insights. Run weekly SLO reviews with engineering/product teams.
  • Incident Response: Lead SEV1/SEV2 incidents without drama. Own comms, run blameless postmortems, and make corrective actions stick.
  • Engineering Reliability: Engineer reliability in multi-AZ/region patterns (active-active/DR), PDBs/Pod Topology Spread, HPA/VPA/KEDA, resilient rollout/rollback.
  • Azure Expertise: Harden AKS clusters (network, identity, policy), optimize node/pod density, ingress (AGIC/Nginx); mesh optional.
  • Observability: Metrics/traces/logs with Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, OpenTelemetry. Alert on symptoms, not noise.
  • IaC & Policy: Terraform/Bicep modules, GitOps (Flux/Argo), policy-as-code (Azure Policy/OPA Gatekeeper). No snowflakes.
  • CI/CD Reliability: Azure DevOps/GitHub Actions with canary/blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
  • Capacity & Performance: Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
  • Disaster Recovery: Define RTO/RPO, test backups/restore, run game days/chaos drills, validate ASR and multi-region failover.
  • Security: Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, shift-left checks in CI.
  • Toil Reduction: Automate recurring ops, build self-service runbooks/chatops, publish golden paths for product teams.
  • Customer Escalations: Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
  • Documentation: Architectures, runbooks, postmortems, SLIs/SLOs—kept current and discoverable.
  • Data Reliability: Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi/Flink/Kafka/Redpanda data flows.

Requirements:

  • Bachelor's degree in Computer Science or related field.
  • At least 12 years of experience in production operations, platform engineering, or SRE, including 5+ years on Azure.
  • Deep operational expertise in PostgreSQL, including HA/DR, logical/physical replication, performance tuning, autovacuum strategy, partitioning, backup/restore testing, and connection pooling.
  • Azure core skills: AKS, Front Door/App Gateway, API Management, VNets/NSGs/Private Link, Storage, Key Vault, Redis, Service Bus/Event Hubs.
  • Observability skills: Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, SLO design, error-budget operations.
  • IaC/automation skills: Terraform and/or Bicep, PowerShell and Python, GitOps (Flux/Argo). Pipelines in Azure DevOps or GitHub Actions.
  • Proven incident leadership at scale, blameless postmortems, SLO/error-budget governance with change gating.
  • Mentorship and crisp written/verbal communication.

PREFERRED SKILLS:

  • Apache NiFi, Apache Flink, Apache Kafka or Redpanda; schema management, exactly-once semantics, backpressure, dead-letter/replay patterns.
  • Azure Solutions Architect Expert, CKA/CKAD.
  • ITSM (ServiceNow), on-call tooling (PagerDuty/Opsgenie).
  • Compliance/SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
  • OpenTelemetry, eBPF tooling, or service mesh.
  • Multi-tenant SaaS and cost optimization at scale.


  • Ahmedabad, Gujarat, India beBeeTechnical Full time ₹ 1,50,00,000 - ₹ 2,00,00,000

    Job OverviewMaintain and enhance cloud infrastructure scripts for various platforms.Collaborate with the STL and Senior Engineer to ensure seamless operations.Work directly with phData Provision Tool and related infrastructure tooling.Key ResponsibilitiesDefine development, test, release, update, and support processes for DevOps operations.Review software...


  • Ahmedabad, Gujarat, India beBeeSite Full time ₹ 30,00,000 - ₹ 40,00,000

    We are on an exciting journey and want you to join us in this dynamic and forward-thinking organization. With our client, you will be exposed to the latest technologies and work with some of the brightest minds in the industry.You will play a key role as a Senior Engineering Manager for Site Reliability, assisting with defining, driving, and implementing the...


  • Ahmedabad, Gujarat, India beBeeIntrinsiceReliability Full time ₹ 2,00,00,000 - ₹ 2,50,00,000

    Job Title: Intrinsic Reliability EngineerAbout the RoleWe are seeking a highly skilled Intrinsic Reliability Engineer to join our team. As an Intrinsic Reliability Engineer, you will be responsible for defining intrinsic and product level reliability requirements, purchasing, installing and operationalizing test and monitoring equipment, developing and...


  • Ahmedabad, Gujarat, India Technobeat Engineer Full time ₹ 1,04,000 - ₹ 1,30,878 per year

    We are seeking a dedicated Senior Admin Executive. The role is to manage and preparation of reports, ensure accuracy, and maintain monthly documentation. Responsible for collecting data from engineers, guiding team in report creation, and submission.


  • Ahmedabad, Gujarat, India beBeeReliability Full time ₹ 2,00,00,000 - ₹ 2,50,00,000

    Job Title: Senior Site Reliability EngineerAbout the JobWe are seeking a seasoned Senior Site Reliability Engineer to join our team as a technical leader, coach, and hands-on problem solver.Key Responsibilities:Investigate and resolve high-impact production issues across infrastructure and applications.Educate and guide development teams on performance,...


  • Ahmedabad, Gujarat, India beBeeEngineering Full time ₹ 60,00,000 - ₹ 1,20,00,000

    Job SummaryThe Technical Manager oversees a remote team of Site Reliability Engineers, ensuring operational excellence and fostering a high-performing team culture.Responsibilities:Provide leadership and management to a team of Site Reliability Engineers, ensuring alignment with organizational priorities and goals.Oversee team operations, including incident...


  • Ahmedabad, Gujarat, India beBeeReliability Full time ₹ 25,00,000 - ₹ 35,00,000

    Job TitleWe are seeking an experienced Platform Engineer to join our team. As a Senior Site Reliability Engineer, you will play a key role in improving system performance and reliability.


  • Ahmedabad, Gujarat, India Infilon Technologies Pvt ltd Full time

    Infilon Technologies Pvt Ltd is a prominent software development company located in Ahmedabad, is hiring a Senior Site Reliability Engineer (Immediate Joiner) for one of its clients TenForce.TenForce is an expert in EHSQ and Operational Risk Management software, based in Belgium and part of Elisa Industriq - a Finnish group committed to making intelligent...


  • Ahmedabad, Gujarat, India beBeeAutomation Full time ₹ 18,00,000 - ₹ 20,25,000

    Get to know this role We approach infrastructure and operations as software engineering challenges. Our mission is to create and advance software platforms that enable the provisioning and management of all services in safe, reliable, and scalable ways.We consistently challenge the status quo and use new technologies to build platforms and tooling for...


  • Ahmedabad, Gujarat, India Talent Corner Hr Services Full time ₹ 1,04,000 - ₹ 1,30,878 per year

    We are hiring a Site Reliability Engineer to build and maintain scalable, reliable, and automated systems ensuring high availability and performance of our applications.