Site Reliability Engineer II
2 days ago
Senior Site Reliability Engineer (SRE II)
Own availability, latency, performance, and efficiency for Zafin's SaaS on Azure. You'll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale.
Reports to the Director of SRE.
What you'll do
- SLIs/SLOs & contracts:
Define customer-centric SLIs/SLOs for Tier-0/Tier-1 services. Publish, review quarterly, and align teams to them. - Error budgeting (policy & tooling):
- Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
- Gate changes by budget status (freeze/relax rules) wired into CI/CD.
- Maintain SLO/EB dashboards (Azure Monitor, Grafana/Prometheus, App Insights). Run weekly SLO reviews with engineering/product.
- Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
- Incidents without drama:
Lead SEV1/SEV2, own comms, run blameless postmortems, and make corrective actions stick. - Engineer reliability in:
Multi-AZ/region patterns (active-active/DR), PDBs/Pod Topology Spread, HPA/VPA/KEDA, resilient rollout/rollback. - AKS at scale:
Harden clusters (network, identity, policy), optimize node/pod density, ingress (AGIC/Nginx); mesh optional. - Observability that works:
Metrics/traces/logs with Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, OpenTelemetry. Alert on symptoms, not noise. - IaC & policy:
Terraform/Bicep modules, GitOps (Flux/Argo), policy-as-code (Azure Policy/OPA Gatekeeper). No snowflakes. - CI/CD reliability:
Azure DevOps/GitHub Actions with canary/blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets. - Capacity & performance:
Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs. - DR you can trust:
Define RTO/RPO, test backups/restore, run game days/chaos drills, validate ASR and multi-region failover. - Secure by default:
Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, shift-left checks in CI. - Reduce toil:
Automate recurring ops, build self-service runbooks/chatops, publish golden paths for product teams. - Customer escalations:
Be the technical owner on calls; communicate tradeoffs and recovery plans with authority. - Document to scale:
Architectures, runbooks, postmortems, SLIs/SLOs—kept current and discoverable. - (If applicable) Streaming/ETL reliability:
Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi/Flink/Kafka/Redpanda data flows.
Minimum qualifications
- Bachelor's in CS/Engineering (or equivalent experience).
- 12+ years
in production ops/platform/SRE, including
5+ years on Azure
. - PostgreSQL (must-have):
Deep operational expertise incl. HA/DR, logical/physical replication, performance tuning (indexes/EXPLAIN/ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup/restore testing, and connection pooling (pgBouncer). Prefer experience with
Azure Database for PostgreSQL – Flexible Server
. - Azure core:
AKS (must-have)
; Front Door/App Gateway, API Management, VNets/NSGs/Private Link, Storage, Key Vault, Redis, Service Bus/Event Hubs. - Observability: Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana; SLO design and error-budget operations.
- IaC/automation: Terraform and/or Bicep; PowerShell and Python; GitOps (Flux/Argo). Pipelines in Azure DevOps or GitHub Actions.
- Proven incident leadership at scale, blameless postmortems, and SLO/error-budget governance with change gating.
- Mentorship and crisp written/verbal communication.
Preferred (nice to have)
- Apache NiFi
,
Apache Flink
,
Apache Kafka
or
Redpanda
(self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter/replay patterns. - Azure Solutions Architect Expert
, CKA/CKAD. - ITSM (ServiceNow), on-call tooling (PagerDuty/Opsgenie).
- Compliance/SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
- OpenTelemetry, eBPF tooling, or service mesh.
- Multi-tenant SaaS and cost optimization at scale.
-
Senior Site Reliability Engineer
4 weeks ago
Thiruvananthapuram, Kerala, India Zafin Full timeJob SummaryZafin is seeking a Cloud Site Reliability Engineer II (CSRE II) to lead strategic initiatives in ensuring the reliability, scalability, and performance of our cloud infrastructure and applications. This advanced role requires mastery in cloud technologies, strategic planning, and incident management to drive innovative solutions and operational...
-
Senior Site Reliability Engineer
4 weeks ago
Thiruvananthapuram, Kerala, India Zafin Full timeJob SummaryZafin is seeking a Cloud Site Reliability Engineer II (CSRE II) to lead strategic initiatives in ensuring the reliability, scalability, and performance of our cloud infrastructure and applications. This advanced role requires mastery in cloud technologies, strategic planning, and incident management to drive innovative solutions and operational...
-
Senior Site Reliability Engineer
3 weeks ago
Thiruvananthapuram, Kerala, India Zafin Full timeJob Summary Zafin is seeking a Cloud Site Reliability Engineer II (CSRE II) to lead strategic initiatives in ensuring the reliability, scalability, and performance of our cloud infrastructure and applications. This advanced role requires mastery in cloud technologies, strategic planning, and incident management to drive innovative solutions and operational...
-
Senior Site Reliability Engineer
5 days ago
Thiruvananthapuram, Kerala, India Zafin Full time US$ 1,50,000 - US$ 2,00,000 per yearJob SummaryZafin is seeking aCloud Site Reliability Engineer II (CSRE II)to lead strategic initiatives in ensuring the reliability, scalability, and performance of our cloud infrastructure and applications. This advanced role requires mastery in cloud technologies, strategic planning, and incident management to drive innovative solutions and operational...
-
Site Reliability Engineer
4 days ago
Thiruvananthapuram, Kerala, India UST Full time US$ 90,000 - US$ 1,20,000 per year5 - 7 Years5 OpeningsTrivandrumRole descriptionUST Global is seeking a highly skilled Site Reliability Engineer (SRE) to work with one of the leading financial services organizations in the US. This role involves managing the end-to-end application and system stack, ensuring high reliability, scalability, and performance of distributed systems. As an SRE,...
-
Site Reliability Engineer
3 hours ago
Thiruvananthapuram, Kerala, India Apexsync Technologies Full timeHello Everyone,We're looking for an experienced Site Reliability Engineer who excels in automation, cloud infrastructure, and observability solutions. The right candidate will combine technical depth with a proactive mindset to drive system reliability and performance.Location: Hyderabad (Hybrid Role. 2-3 days in office ) Experience level: Senior ( 7 years...
-
Site Reliability Engineer
2 days ago
Thiruvananthapuram, Kerala, India Equifax Full time ₹ 5,00,000 - ₹ 8,00,000 per yearSite Reliability Engineering (SRE)at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to Equifax engineering principles.SRE is also an...
-
Reliable Software Engineer
15 hours ago
Thiruvananthapuram, Kerala, India beBeesre Full time ₹ 20,00,000 - ₹ 25,00,000Senior Site Reliability EngineerWe are seeking a skilled Senior Site Reliability Engineer to join our team. As a critical member of our platform engineering group, you will play a key role in ensuring the reliability and scalability of our SaaS real estate platform.
-
Senior DevOps/Site Reliability Engineer
6 days ago
Thiruvananthapuram, Kerala, India Scoop Technologies Pvt Ltd Full timeJob Title : Senior DevOps Engineer / Site Reliability Engineer (SRE)Experience : 5 to 8 YearsLocation : Thiruvananthapuram (TVM), Kochi, ChennaiJob Overview : We are seeking a highly skilled Senior DevOps Engineer / Site Reliability Engineer (SRE) with 58 years of experience to join our fast-paced and technology-driven environment. The ideal candidate will...
-
Senior Site Reliability Expert
2 weeks ago
Thiruvananthapuram, Kerala, India beBeeTechnical Full time ₹ 20,00,000 - ₹ 25,00,000Job TitleSite Reliability Engineer - Technical Leader and Problem SolverKey Responsibilities:Investigate and resolve high-impact production issues across infrastructure and applications.Collaborate with development teams to improve performance, reliability, and architecture of systems.Participate in incident response efforts as a technical expert.Develop...