
Site Reliability Engineer
1 week ago
About Us: Kshema General Insurace is a leading innovator in Crop Insurance. We are building scalable, reliable, and high-performance cloud-native applications on Microsoft Azure. We are seeking a talented and passionate Site Reliability Engineer (SRE) to join our team, focusing on establishing robust observability with OpenTelemetry and driving operational excellence across our Azure infrastructure.
Role Overview: As an SRE with OpenTelemetry and Azure expertise, you will play a critical role in ensuring the availability, performance, and scalability of our production systems. You will be responsible for designing, implementing, and maintaining our observability stack using OpenTelemetry standards, integrating it seamlessly with Azure services, and applying SRE principles to build resilient and efficient systems. You will work closely with development teams to embed reliability from the ground up, automate operational tasks, and respond to incidents with speed and precision.
Key Responsibilities:
OTEL Monitoring Setup & Observability:
- Design, implement, and manage a comprehensive observability platform using OpenTelemetry for distributed tracing, metrics, and logs across our microservices and applications.
- Ensure full instrumentation of applications (e.g., Java, Python, Node.js) to capture end-to-end telemetry data.
- Configure and optimize OpenTelemetry Collectors to receive, process, and export telemetry data to various backends (e.g., Prometheus, Grafana, Application Insights, Jaeger, Loki, Tempo and Azure Monitor).
- Develop custom instrumentation and semantic conventions to enhance monitoring capabilities and provide deeper insights into application behavior.
- Establish robust alerting and anomaly detection based on OpenTelemetry signals, utilizing tools like Azure Monitor, Prometheus Alertmanager, or similar.
- Create informative and actionable dashboards (e.g., Grafana, Azure Dashboards) for real-time system insights, performance monitoring, and incident response.
- Continuously evaluate and integrate new OpenTelemetry features and best practices to improve our observability posture.
Azure SRE Capabilities:
- Reliability & Performance Engineering: Monitor system performance, reliability, and availability metrics across Azure services. Identify bottlenecks, anticipate scaling needs, and implement strategies to reduce downtime and improve performance.
- Incident Management & Response: Participate in on-call rotations, lead incident response efforts, conduct thorough root cause analysis (RCA), and implement preventative measures to minimize recurrence. Develop and maintain runbooks and playbooks for effective incident resolution.
- Automation & Infrastructure as Code (IaC): Automate repetitive operational tasks, deployments, and infrastructure provisioning using Azure DevOps, Terraform, Azure Bicep, PowerShell, or Bash scripting.
- CI/CD Integration: Integrate observability checks and validation steps into CI/CD pipelines to ensure the reliability and performance of new releases.
- Capacity Planning & Cost Optimization: Conduct capacity planning, analyze usage patterns, and optimize Azure resources for cost efficiency, performance, and scalability.
- Security & Compliance: Implement and enforce security best practices within Azure environments, collaborate with security teams, and ensure adherence to relevant compliance standards.
- Collaboration & Mentorship: Work closely with development teams to foster a culture of reliability, provide guidance on observability best practices, and share knowledge across the organization.
Required Skills and Experience:
- 5+ years of experience in a Site Reliability Engineering (SRE), DevOps, or a similar infrastructure-focused role.
- Deep practical experience with OpenTelemetry (OTEL) for instrumenting, collecting, processing, and exporting traces, metrics, and logs.
- Strong proficiency in Azure cloud services and their monitoring capabilities (Azure Monitor, Log Analytics, Application Insights).
- Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform, Azure Bicep, or ARM templates.
- Solid scripting and automation skills (e.g., Python, PowerShell, Bash).
- Experience with containerization technologies (Docker) and orchestration platforms (Kubernetes/AKS).
- Expertise with various observability backends like Grafana, Alloy, Loki, Tempo, Prometheus, Jaeger.
- Strong understanding of distributed systems, microservices architectures, and cloud-native principles.
- Excellent problem-solving, analytical, and troubleshooting skills.
- Strong communication and collaboration abilities.
Preferred Qualifications:
- Azure certifications (e.g., AZ-104 Azure Administrator, AZ-400 Azure DevOps Engineer Expert).
- Experience with chaos engineering practices.
- Understanding of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
- Familiarity with database monitoring (e.g., PostgreSQL, Azure SQL).
- Experience in a high-availability, regulated, or customer-facing environment.
Education:
Bachelor's degree in Computer Science, Information Technology, or a related technical field, or equivalent practical experience.
-
Site Reliability Engineer
2 days ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 9,00,000 - ₹ 12,00,000 per yearSite Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...
-
Site Reliability Engineer
3 weeks ago
Hyderabad, Telangana, India Talent Worx Full timeTalent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services.Your work will involve both software engineering and systems operations as you strive to improve customer experiences and operational...
-
Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India IntraEdge Full timeSite Reliability EngineerExperience: 7+ YearsLocation: HyderabadHybrid 4-day office and 1 Day remoteSkills for Principal:Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Advanced project management capabilities.Excellent communication and collaboration skills.Adept at risk assessment and crisis...
-
Site Reliability Engineer
4 weeks ago
Hyderabad, Telangana, India IntraEdge Full timePosition - SRE (Site Reliability Engineer)Experience - 5+ YearsLocation - HyderabadSkills for Principal:Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Advanced project management capabilities.Excellent communication and collaboration skills.Adept at risk assessment and crisis...
-
Site Reliability Engineer
1 day ago
Hyderabad, Telangana, India IntraEdge Full timeSite Reliability EngineerExperience: 7+ YearsLocation: HyderabadHybrid 4-day office and 1 Day remoteSkills for Principal:- Strong leadership and people management skills.- Exceptional technical proficiency in Pearson's technology stack.- Advanced project management capabilities.- Excellent communication and collaboration skills.- Adept at risk assessment and...
-
SRE(Site Reliability Engineer)
2 days ago
Hyderabad, Telangana, India Talent Worx Full time ₹ 15,00,000 - ₹ 20,00,000 per yearSRE (Site Reliability Engineer)Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services. Your work will involve both software engineering and systems operations as you strive to improve...
-
Site Reliability Engineer
4 weeks ago
Hyderabad, Telangana, India VXI Global Solutions Full timeWe are seeking a skilled Site Reliability Engineer with 4 to 8 years for Experience into design, implement, and manage robust observability solutions across our cloud infrastructure and applications. The ideal candidate will have hands-on experience with Prometheus, Grafana, Google Cloud Monitoring, and OpenTelemetry, along with exposure to SolarWinds. You...
-
Lead Site Reliability Engineer
3 days ago
Hyderabad, Telangana, India JP Morgan Chase & Co. Full timeJob DescriptionAssume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.As a Lead Site Reliability Engineer at JPMorgan Chase within the Consumer & Community Banking Team, you will take the lead in conducting resiliency design reviews, break...
-
Site Reliability Engineer III
13 hours ago
Hyderabad, Telangana, India Chase Bank Full timeJob DescriptionThere's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Consumer and Community Banking, youwill solve complex and broad...
-
Site Reliability Engineer
3 weeks ago
Hyderabad, Telangana, India Kshema General Insurance Limited Full timeAbout Us: Kshema General Insurace is a leading innovator in Crop Insurance. We are building scalable, reliable, and high-performance cloud-native applications on Microsoft Azure. We are seeking a talented and passionate Site Reliability Engineer (SRE) to join our team, focusing on establishing robust observability with OpenTelemetry and driving operational...