 
						Site reliability engineer ii
4 weeks ago
Senior Site Reliability Engineer (SRE II) Own availability, latency, performance, and efficiency for Zafin’s Saa S on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.What you’ll doSLIs/SLOs & contracts: Define customer-centric SLIs/SLOs for Tier-0/Tier-1 services. Publish, review quarterly, and align teams to them.Error budgeting (policy & tooling):Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.Gate changes by budget status (freeze/relax rules) wired into CI/CD.Maintain SLO/EB dashboards (Azure Monitor, Grafana/Prometheus, App Insights). Run weekly SLO reviews with engineering/product.Drive roadmap tradeoffs when budgets are at risk; land reliability epics.Incidents without drama: Lead SEV1/SEV2, own comms, run blameless postmortems, and make corrective actions stick.Engineer reliability in: Multi-AZ/region patterns (active-active/DR), PDBs/Pod Topology Spread, HPA/VPA/KEDA, resilient rollout/rollback.AKS at scale: Harden clusters (network, identity, policy), optimize node/pod density, ingress (AGIC/Nginx); mesh optional.Observability that works: Metrics/traces/logs with Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, Open Telemetry. Alert on symptoms, not noise.Ia C & policy: Terraform/Bicep modules, Git Ops (Flux/Argo), policy-as-code (Azure Policy/OPA Gatekeeper). No snowflakes.CI/CD reliability: Azure Dev Ops/Git Hub Actions with canary/blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.Capacity & performance: Load testing, right-sizing, autoscaling; partner with Fin Ops to reduce spend without hurting SLOs.DR you can trust: Define RTO/RPO, test backups/restore, run game days/chaos drills, validate ASR and multi-region failover.Secure by default: Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, shift-left checks in CI.Reduce toil: Automate recurring ops, build self-service runbooks/chatops, publish golden paths for product teams.Customer escalations: Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.Document to scale: Architectures, runbooks, postmortems, SLIs/SLOs—kept current and discoverable.(If applicable) Streaming/ETL reliability: Apply SRE practices (SLOs, backpressure, idempotency, replay) to Ni Fi/Flink/Kafka/Redpanda data flows.Minimum qualificationsBachelor’s in CS/Engineering (or equivalent experience).12+ years in production ops/platform/SRE, including 5+ years on Azure.Postgre SQL (must-have): Deep operational expertise incl. HA/DR, logical/physical replication, performance tuning (indexes/EXPLAIN/ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup/restore testing, and connection pooling (pg Bouncer). Prefer experience with Azure Database for Postgre SQL – Flexible Server.Azure core: AKS (must-have); Front Door/App Gateway, API Management, VNets/NSGs/Private Link, Storage, Key Vault, Redis, Service Bus/Event Hubs.Observability: Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana; SLO design and error-budget operations.Ia C/automation: Terraform and/or Bicep; Power Shell and Python; Git Ops (Flux/Argo). Pipelines in Azure Dev Ops or Git Hub Actions.Proven incident leadership at scale, blameless postmortems, and SLO/error-budget governance with change gating.Mentorship and crisp written/verbal communication.Preferred (nice to have)Apache Ni Fi, Apache Flink, Apache Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter/replay patterns.Azure Solutions Architect Expert, CKA/CKAD.ITSM (Service Now), on-call tooling (Pager Duty/Opsgenie).Compliance/Sec Ops (SOC 2, ISO 27001), policy-as-code, workload identity.Open Telemetry, e BPF tooling, or service mesh.Multi-tenant Saa S and cost optimization at scale.
- 
					  Site Reliability Engineer II3 weeks ago 
 Thiruvananthapuram, India Zafin Full timeSenior Site Reliability Engineer (SRE II) Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE. What you’ll do SLIs/SLOs & contracts: Define customer-centric... 
- 
					  Site Reliability Engineer II3 weeks ago 
 Thiruvananthapuram, India Zafin Full timeSenior Site Reliability Engineer (SRE II) Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE. What you’ll do SLIs/SLOs & contracts: Define customer-centric... 
- 
					Site Reliability Engineer II18 hours ago 
 Thiruvananthapuram, Kerala, India Zafin Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSenior Site Reliability Engineer (SRE II)Own availability, latency, performance, and efficiency for Zafin's SaaS on Azure. You'll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale.Reports to the Director of SRE.What you'll doSLIs/SLOs & contracts: Define customer-centric SLIs/SLOs for... 
- 
					Site Reliability Engineer3 days ago 
 Thiruvananthapuram, Kerala, India UST Full time ₹ 12,00,000 - ₹ 36,00,000 per year5 - 7 Years5 OpeningsTrivandrumRole descriptionUST Global is seeking a highly skilled Site Reliability Engineer (SRE) to work with one of the leading financial services organizations in the US. This role involves managing the end-to-end application and system stack, ensuring high reliability, scalability, and performance of distributed systems. As an SRE,... 
- 
					Site Reliability Engineer17 hours ago 
 Thiruvananthapuram, Kerala, India Equifax Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSite Reliability Engineering (SRE)at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to Equifax engineering principles.SRE is also an... 
- 
					  Site Reliability Engineer – Technical Architect3 weeks ago 
 Thiruvananthapuram, India Tata Elxsi Full timeSite Reliability Engineer – Technical Architect We are looking for experienced professionals to join us as Site Reliability Engineer. If you know someone who fits the bill, refer them to join our growing team. Key Skills & Responsibilities: Proficiency in one or more high-level programming languages: Python, Java, C/C++, Ruby, JavaScript Experience... 
- 
					Senior Site Reliability Engineer2 weeks ago 
 Thiruvananthapuram, Kerala, India Equifax Full time ₹ 5,00,000 - ₹ 15,00,000 per yearSite Reliability Engineering (SRE)at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to Equifax engineering principles.SRE is also an... 
- 
					  Site Reliability Engineer4 weeks ago 
 Thiruvananthapuram / Trivandrum, India Reflections Info Systems Full timeJob Description As a Site Reliability Engineer (SRE) you will be responsible for improving the overall reliability of applications by ensuring its availability, performance, and scalability. Should be able to gather the technical requirements from the DevOps team and the operational requirements from the Application Support team. With the Site Reliability... 
- 
					  Devops Site Reliability Engineer4 weeks ago 
 Thiruvananthapuram, Kerala, India UST Full timeKey Skills Devops SRE Any Cloud Aws Azure GCP IAC KubernetesExp 7-11 YearsLoc Noida Pune Bangalore Hyderabad Chennai Kochi TrivandrumJob Summary The Site Reliability Engineer SRE ensures the reliability availability and performance of critical systems and services This role bridges the gap between development and operations teams ... 
- 
					  Senior Site Reliability Engineer2 weeks ago 
 Thiruvananthapuram / Trivandrum, India Equifax Full timeJob Description Site Reliability Engineering (SRE) at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to Equifax engineering principles. SRE is...