Senior Site Reliability Engineer
2 days ago
About the Role
We are looking for a Senior Site Reliability Engineer (SRE) to lead the reliability strategy of our mission-critical HealthTech SaaS platform. This role is designed for a hands-on engineer who can architect and operate large-scale, high-availability systems, establish a 24×7 SRE practice, and enforce reliability standards through SLAs, SLOs, and error budgets. You will be responsible for ensuring uptime, performance, observability, and seamless deployments for a system serving hospitals, clinicians, and critical healthcare operations.
Key Responsibilities
1. Build & Lead the SRE Practice (24×7 Model)
● Establish a round-the-clock SRE operation with robust on-call processes.
● Define escalation paths, runbooks, SOPs, and reliability governance.
● Mentor and onboard SRE team members to build a high-performing reliability culture.
2. Reliability & Performance Engineering
● Own service uptime, latency, and error rate metrics; ensure adherence to defined SLAs/SLOs.
● Create and manage Error Budgets, drive conversations with engineering to maintain reliability.
● Conduct capacity planning, load forecasting, and performance tuning.
3. Observability & Monitoring (Hands-on with RUM/APM)
● Implement and manage tools such as: ○ Real User Monitoring (RUM)
○ APM tools (New Relic, Grafana Tempo, Dynatrace, DataDog, AppDynamics, etc.)
○ Infrastructure monitoring (Prometheus, Grafana, ELK/EFK, CloudWatch/Stackdriver)
● Build dashboards, alerts, tracing flows, synthetic monitoring, and anomaly detection systems.
4. Incident Management & Root Cause Analysis
● Lead major incidents and outages with calm, structured execution.
● Drive after-action reviews with 5-Why, fishbone, RCA documents.
● Collaborate with engineering and DevOps teams to implement preventive fixes.
5. Deployment, Automation & Reliability Tooling
● Improve CI/CD pipelines to ensure safe, predictable deployments.
● Implement:
○ Canary deployments
○ Blue/green deployments
○ Auto-remediation scripting
○ Chaos engineering practice (preferred)
● Automate repeatable operational tasks to reduce toil.
6. Infrastructure & System Architecture
● Work with cloud platforms (AWS/GCP/Azure) to optimize performance and cost.
● Manage:
○ Kubernetes clusters
○ Service meshes
○ Distributed systems
○ Database reliability
● Ensure zero-downtime releases and robust failover strategies.
Required Skills & Experience Technical Skills
● 8–12 years of SRE/DevOps/Production Engineering experience.
● Strong hands-on experience with RUM & APM tools.
● Deep understanding of:
○ Distributed systems
○ Microservices
○ Containers & Kubernetes
○ Networking fundamentals
○ Load balancers, CDNs, caching layers
● Strong scripting skills (Python, Bash, Go preferred).
● Experience with SQL/NoSQL databases and performance tuning.
● Expertise in observability stacks (Prometheus, Grafana, Loki, Jaeger, Kibana). SRE Practice Skills
● Proven ability to define and enforce SLA, SLO, SLI frameworks.
● Experience building or scaling 24×7 support models.
● Strong grounding in incident management, change management, and release processes.
● Understanding of security, compliance, and audit readiness—important for healthcare (HIPAA/NDHM awareness is a plus).
Soft Skills
● Excellent communication skills; ability to simplify technical issues for leadership.
● Strong ownership, accountability, and customer-centric thinking.
● Ability to coordinate across engineering, DevOps, product, and infrastructure teams.
Nice-to-Have Skills
● Experience with healthcare SaaS or critical systems.
● Knowledge of OTEL (OpenTelemetry) instrumentation.
● Chaos engineering tools (LitmusChaos, Gremlin).
● Experience with automation frameworks for alert triage.
What Success Looks Like
● 99.9%+ uptime with measurable SLO tracking.
● Full 24×7 SRE team established with rotation and playbooks.
● Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
● Predictable and low-risk production deployments.
● Highly observable system with actionable monitoring and automated alerts.
Job Type: Full-time
Pay: From ₹900,000.00 per year
Benefits:
- Paid time off
Education:
- Bachelor's (Preferred)
Experience:
- Production Engineering: 8 years (Preferred)
- RUM & APM tools: 8 years (Preferred)
- Python: 8 years (Preferred)
- SQL/NoSQL databases: 8 years (Preferred)
- performance tuning: 8 years (Preferred)
- observability stacks : 8 years (Preferred)
Work Location: In person
-
Site Reliability Engineer
4 weeks ago
Ahmedabad, India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – AWSExperience: 8+ yearsLocation: Chennai / MumbaiWork Mode: HybridKey Skills: AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, DatadogJob Summary:We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experience and a solid background in DevOps, automation, observability, and...
-
Site Reliability Engineer
4 weeks ago
Ahmedabad, India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – AWSExperience: 8+ yearsLocation: Chennai / MumbaiWork Mode: HybridKey Skills: AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, DatadogJob Summary:We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experience and a solid background in DevOps, automation, observability, and...
-
Site Reliability Engineer
2 weeks ago
Ahmedabad, India Proglite Full timeWe have the following requirements for the Site Reliability Engineer roleSkill Set:AWS: EC2, Networking, Storage, autoscaling, CloudWatch, SSM, management (patching/upgrades/security) of OS(windows/Linux) in EC2GCP: GKE/Compute, Networking, storage, Cloud Monitoring, management (patching/upgrades/security) of OS(windows/Linux) in computeSRE Practices:...
-
Site Reliability Engineer
2 weeks ago
Ahmedabad, India Proglite Full timeWe have the following requirements for the Site Reliability Engineer roleSkill Set:AWS: EC2, Networking, Storage, autoscaling, CloudWatch, SSM, management (patching/upgrades/security) of OS(windows/Linux) in EC2GCP: GKE/Compute, Networking, storage, Cloud Monitoring, management (patching/upgrades/security) of OS(windows/Linux) in computeSRE Practices:...
-
Site Reliability Engineer
1 day ago
ahmedabad, India ACL Digital Full timeJob Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...
-
Site Reliability Engineer
1 day ago
Ahmedabad, India ACL Digital Full timeJob Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...
-
Site Reliability Engineer
1 day ago
Ahmedabad, India ACL Digital Full timeJob Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...
-
Site Reliability Engineer
1 day ago
Ahmedabad, India ACL Digital Full timeJob Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...
-
Site Reliability Engineer
1 day ago
Ahmedabad, India ACL Digital Full timeJob Description : - Continuous monitoring of system performance and identify potential issues before they impact users. - Experience working with Industry leading monitoring tools. - Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly. - Analyze monitoring data to identify trends, anomalies, to...
-
civil site engineer
1 week ago
Thaltej, Ahmedabad, Gujarat, India HuminivexHires LLP Full timeExperience: Minimum 3 years (experience in bungalow projects will be an added advantage)Joining: Immediate joiner preferredRole Overview:We are looking for skilled Civil Site Engineers who can independently handle day-to-day site execution work for residential/bungalow projects. The ideal candidate should have strong technical knowledge, coordination skills,...