
Site Reliability Engineer II
3 days ago
Senior Site Reliability Engineer (SRE II)
Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.
What you’ll do
- SLIs/SLOs & contracts: Define customer-centric SLIs/SLOs for Tier-0/Tier-1 services. Publish, review quarterly, and align teams to them.
- Error budgeting (policy & tooling):
- Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
- Gate changes by budget status (freeze/relax rules) wired into CI/CD.
- Maintain SLO/EB dashboards (Azure Monitor, Grafana/Prometheus, App Insights). Run weekly SLO reviews with engineering/product.
- Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
- Incidents without drama: Lead SEV1/SEV2, own comms, run blameless postmortems, and make corrective actions stick.
- Engineer reliability in: Multi-AZ/region patterns (active-active/DR), PDBs/Pod Topology Spread, HPA/VPA/KEDA, resilient rollout/rollback.
- AKS at scale: Harden clusters (network, identity, policy), optimize node/pod density, ingress (AGIC/Nginx); mesh optional.
- Observability that works: Metrics/traces/logs with Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, OpenTelemetry. Alert on symptoms, not noise.
- IaC & policy: Terraform/Bicep modules, GitOps (Flux/Argo), policy-as-code (Azure Policy/OPA Gatekeeper). No snowflakes.
- CI/CD reliability: Azure DevOps/GitHub Actions with canary/blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
- Capacity & performance: Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
- DR you can trust: Define RTO/RPO, test backups/restore, run game days/chaos drills, validate ASR and multi-region failover.
- Secure by default: Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, shift-left checks in CI.
- Reduce toil: Automate recurring ops, build self-service runbooks/chatops, publish golden paths for product teams.
- Customer escalations: Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
- Document to scale: Architectures, runbooks, postmortems, SLIs/SLOs—kept current and discoverable.
- (If applicable) Streaming/ETL reliability: Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi/Flink/Kafka/Redpanda data flows.
Minimum qualifications
- Bachelor’s in CS/Engineering (or equivalent experience).
- 12+ years in production ops/platform/SRE, including 5+ years on Azure .
- PostgreSQL (must-have): Deep operational expertise incl. HA/DR, logical/physical replication, performance tuning (indexes/EXPLAIN/ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup/restore testing, and connection pooling (pgBouncer). Prefer experience with Azure Database for PostgreSQL – Flexible Server .
- Azure core: AKS (must-have) ; Front Door/App Gateway, API Management, VNets/NSGs/Private Link, Storage, Key Vault, Redis, Service Bus/Event Hubs.
- Observability: Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana; SLO design and error-budget operations.
- IaC/automation: Terraform and/or Bicep; PowerShell and Python; GitOps (Flux/Argo). Pipelines in Azure DevOps or GitHub Actions.
- Proven incident leadership at scale, blameless postmortems, and SLO/error-budget governance with change gating.
- Mentorship and crisp written/verbal communication.
Preferred (nice to have)
- Apache NiFi , Apache Flink , Apache Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter/replay patterns.
- Azure Solutions Architect Expert , CKA/CKAD.
- ITSM (ServiceNow), on-call tooling (PagerDuty/Opsgenie).
- Compliance/SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
- OpenTelemetry, eBPF tooling, or service mesh.
- Multi-tenant SaaS and cost optimization at scale.
-
Site Reliability Engineer II
2 days ago
Trivandrum, India Zafin Full timeSenior Site Reliability Engineer (SRE II) Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE. What you’ll do - SLIs/SLOs & contracts: Define customer-centric...
-
Site Reliability Engineer II
2 days ago
Trivandrum, India Zafin Full timeSenior Site Reliability Engineer (SRE II) Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.What you’ll doSLIs/SLOs & contracts: Define customer-centric SLIs/SLOs for...
-
Site Reliability Engineer II
2 days ago
Trivandrum, India Zafin Full timeSenior Site Reliability Engineer (SRE II) Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE. What you’ll do SLIs/SLOs & contracts: Define customer-centric...
-
Health Safety Environment Engineer
3 days ago
Trivandrum, India Target Engineering Construction Co LLC Full timeJob SummaryWe are seeking a proactive and experienced HSE Engineer to join our team and support the safe execution of oil & gas EPC projects. The ideal candidate will have a strong background in health, safety, and environmental management, with a passion for promoting a safety-first culture across all project phases.NOTE : This position is for deployment...
-
AI/ML Engineer
3 days ago
Trivandrum, India Nuvae.ai Full timeCompany DescriptionNuvae delivers an advanced GenAI-powered Revenue Management Agent and Conversational AI Platform tailored for hospitals and healthcare practices. Our solutions specialize in information retrieval, Retrieval-Augmented Generation (RAG), process automation, and generating actionable insights. By leveraging Nuvae, healthcare organizations can...
-
SRE Devops Engineer
2 days ago
Thiruvananthapuram / Trivandrum, Chennai, India Kaizen SRA Technologies Private Limited Full timeJob Description Description We are seeking an experienced SRE DevOps Engineer to join our team in India. The ideal candidate will have a strong background in system reliability and automation, with a passion for improving system performance and ensuring high availability. Responsibilities - Design, implement, and maintain highly available systems and...
-
SRE Devops Engineer
7 days ago
Thiruvananthapuram / Trivandrum, Chennai, India Kaizen SRA Technologies Private Limited Full timeJob DescriptionDescriptionWe are seeking an experienced SRE DevOps Engineer to join our team in India. The ideal candidate will have a strong background in system reliability and automation, with a passion for improving system performance and ensuring high availability.Responsibilities- Design, implement, and maintain highly available systems and...
-
System Engineer
2 days ago
Trivandrum, India Terumo Blood and Cell Technologies Full timeJOB SUMMARY Senior systems engineer translates the user needs and business needs into the specifications that drive design, implementation, testing of products that are used for collection of blood and blood components. Senior systems engineer will practice design controls in compliance with the quality management systems to deliver on products that are...
-
Senior Electronics Engineer
2 days ago
Trivandrum, India Terumo Blood and Cell Technologies Full timeJOB SUMMARY We are looking for a highly skilled and experienced Senior Embedded Systems Engineer to join our dynamic team. In this role, he/ she will: - Be responsible for Designing, developing, and maintaining embedded systems and software for medical devices. - Work closely with cross-functional teams to ensure the successful integration of hardware and...
-
(Apply Now) Site Reliability Developer 2
2 days ago
Thiruvananthapuram / Trivandrum, India Oracle Full timeJob Description Job Description Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale...