Senior Site Reliability Engineer

2 days ago


Thaltej Ahmedabad Gujarat, India Artem HealthTech Private Limited Full time

About the Role

We are looking for a Senior Site Reliability Engineer (SRE) to lead the reliability strategy of our mission-critical HealthTech SaaS platform. This role is designed for a hands-on engineer who can architect and operate large-scale, high-availability systems, establish a 24×7 SRE practice, and enforce reliability standards through SLAs, SLOs, and error budgets. You will be responsible for ensuring uptime, performance, observability, and seamless deployments for a system serving hospitals, clinicians, and critical healthcare operations.

Key Responsibilities

1. Build & Lead the SRE Practice (24×7 Model)

● Establish a round-the-clock SRE operation with robust on-call processes.

● Define escalation paths, runbooks, SOPs, and reliability governance.

● Mentor and onboard SRE team members to build a high-performing reliability culture.

2. Reliability & Performance Engineering

● Own service uptime, latency, and error rate metrics; ensure adherence to defined SLAs/SLOs.

● Create and manage Error Budgets, drive conversations with engineering to maintain reliability.

● Conduct capacity planning, load forecasting, and performance tuning.

3. Observability & Monitoring (Hands-on with RUM/APM)

● Implement and manage tools such as: ○ Real User Monitoring (RUM)

○ APM tools (New Relic, Grafana Tempo, Dynatrace, DataDog, AppDynamics, etc.)

○ Infrastructure monitoring (Prometheus, Grafana, ELK/EFK, CloudWatch/Stackdriver)

● Build dashboards, alerts, tracing flows, synthetic monitoring, and anomaly detection systems.

4. Incident Management & Root Cause Analysis

● Lead major incidents and outages with calm, structured execution.

● Drive after-action reviews with 5-Why, fishbone, RCA documents.

● Collaborate with engineering and DevOps teams to implement preventive fixes.

5. Deployment, Automation & Reliability Tooling

● Improve CI/CD pipelines to ensure safe, predictable deployments.

● Implement:

○ Canary deployments

○ Blue/green deployments

○ Auto-remediation scripting

○ Chaos engineering practice (preferred)

● Automate repeatable operational tasks to reduce toil.

6. Infrastructure & System Architecture

● Work with cloud platforms (AWS/GCP/Azure) to optimize performance and cost.

● Manage:

○ Kubernetes clusters

○ Service meshes

○ Distributed systems

○ Database reliability

● Ensure zero-downtime releases and robust failover strategies.

Required Skills & Experience Technical Skills

● 8–12 years of SRE/DevOps/Production Engineering experience.

● Strong hands-on experience with RUM & APM tools.

● Deep understanding of:

○ Distributed systems

○ Microservices

○ Containers & Kubernetes

○ Networking fundamentals

○ Load balancers, CDNs, caching layers

● Strong scripting skills (Python, Bash, Go preferred).

● Experience with SQL/NoSQL databases and performance tuning.

● Expertise in observability stacks (Prometheus, Grafana, Loki, Jaeger, Kibana). SRE Practice Skills

● Proven ability to define and enforce SLA, SLO, SLI frameworks.

● Experience building or scaling 24×7 support models.

● Strong grounding in incident management, change management, and release processes.

● Understanding of security, compliance, and audit readiness—important for healthcare (HIPAA/NDHM awareness is a plus).

Soft Skills

● Excellent communication skills; ability to simplify technical issues for leadership.

● Strong ownership, accountability, and customer-centric thinking.

● Ability to coordinate across engineering, DevOps, product, and infrastructure teams.

Nice-to-Have Skills

● Experience with healthcare SaaS or critical systems.

● Knowledge of OTEL (OpenTelemetry) instrumentation.

● Chaos engineering tools (LitmusChaos, Gremlin).

● Experience with automation frameworks for alert triage.

What Success Looks Like

● 99.9%+ uptime with measurable SLO tracking.

● Full 24×7 SRE team established with rotation and playbooks.

● Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).

● Predictable and low-risk production deployments.

● Highly observable system with actionable monitoring and automated alerts.

Job Type: Full-time

Pay: From ₹900,000.00 per year

Benefits:

  • Paid time off

Education:

  • Bachelor's (Preferred)

Experience:

  • Production Engineering: 8 years (Preferred)
  • RUM & APM tools: 8 years (Preferred)
  • Python: 8 years (Preferred)
  • SQL/NoSQL databases: 8 years (Preferred)
  • performance tuning: 8 years (Preferred)
  • observability stacks : 8 years (Preferred)

Work Location: In person



  • Ahmedabad, India Datum Technologies Group Full time

    Job Title: Site Reliability Engineer (SRE) – AWSExperience: 8+ yearsLocation: Chennai / MumbaiWork Mode: HybridKey Skills: AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, DatadogJob Summary:We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experience and a solid background in DevOps, automation, observability, and...


  • Ahmedabad, India Datum Technologies Group Full time

    Job Title: Site Reliability Engineer (SRE) – AWSExperience: 8+ yearsLocation: Chennai / MumbaiWork Mode: HybridKey Skills: AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, DatadogJob Summary:We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experience and a solid background in DevOps, automation, observability, and...


  • Ahmedabad, India Proglite Full time

    We have the following requirements for the Site Reliability Engineer roleSkill Set:AWS: EC2, Networking, Storage, autoscaling, CloudWatch, SSM, management (patching/upgrades/security) of OS(windows/Linux) in EC2GCP: GKE/Compute, Networking, storage, Cloud Monitoring, management (patching/upgrades/security) of OS(windows/Linux) in computeSRE Practices:...


  • Ahmedabad, India Proglite Full time

    We have the following requirements for the Site Reliability Engineer roleSkill Set:AWS: EC2, Networking, Storage, autoscaling, CloudWatch, SSM, management (patching/upgrades/security) of OS(windows/Linux) in EC2GCP: GKE/Compute, Networking, storage, Cloud Monitoring, management (patching/upgrades/security) of OS(windows/Linux) in computeSRE Practices:...


  • ahmedabad, India ACL Digital Full time

    Job Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...


  • Ahmedabad, India ACL Digital Full time

    Job Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...


  • Ahmedabad, India ACL Digital Full time

    Job Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...


  • Ahmedabad, India ACL Digital Full time

    Job Description :- Continuous monitoring of system performance and identify potential issues before they impact users.- Experience working with Industry leading monitoring tools.- Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly.- Analyze monitoring data to identify trends, anomalies, to...


  • Ahmedabad, India ACL Digital Full time

    Job Description : - Continuous monitoring of system performance and identify potential issues before they impact users. - Experience working with Industry leading monitoring tools. - Respond to incidents related to monitoring systems, troubleshooting Level 1 issues and resolving issues promptly. - Analyze monitoring data to identify trends, anomalies, to...

  • civil site engineer

    1 week ago


    Thaltej, Ahmedabad, Gujarat, India HuminivexHires LLP Full time

    Experience: Minimum 3 years (experience in bungalow projects will be an added advantage)Joining: Immediate joiner preferredRole Overview:We are looking for skilled Civil Site Engineers who can independently handle day-to-day site execution work for residential/bungalow projects. The ideal candidate should have strong technical knowledge, coordination skills,...