▷ (Apply in 3 Minutes) Site Reliability Engineer

3 weeks ago

Hyderabad, India Talentiser Full time

YOUR IMPACT: Reliability, Automation, and Observability As a hybrid Site Reliability Engineer/DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-critical SaaS platform. You'll apply engineering principles to operational challenges, constantly striving to eliminate toil through automation. Operational Excellence & Reliability ● Provide day-to-day management of system alerts, check system health, and escalate issues as necessary to maintain high availability. ● Actively participate in a 24x7 on-call rotation for critical SaaS platform incidents, and be available in case of emergencies. ● Lead the incident response process, ensuring fast and effective mitigation and resolution of production issues. ● Perform thorough Root Cause Analysis (RCA) and lead blameless post-mortems to identify systemic weaknesses and create a corrective action plan to prevent recurrence. ● Collaborate with engineering teams to set and enforce error budgets (derived from SLOs, or Service Level Objectives), ensuring a healthy balance between development speed and system stability. Platform Automation & Infrastructure Development ● Automate routine operational tasks to reduce manual effort and "toil" and increase overall team efficiency. ● Design, deploy, and maintain cloud infrastructure using Infrastructure as Code (IaC), specifically leveraging Terraform and Helm for deployment to EKS/K8s clusters. ● Improve existing infrastructure health by developing and implementing checks and scripts to proactively correct known issues and self-heal the platform. ● Maintain, develop, and evolve our Continuous Integration/Continuous Delivery (CI/CD) deployment code and pipelines. ● Learn and maintain existing infrastructure running under Docker and Docker Swarm while driving migration strategies toward EKS/K8s. ● Implement and integrate new technologies and services into our Cloud Infrastructure to enhance platform capabilities and resilience. Monitoring & Observability ● Design and implement comprehensive Observability strategies across all three pillars: Metrics, Logs, and Traces. ● Proactively create and refine robust monitoring and alerting configurations within the EKS/K8s ecosystem. ● Utilize and maintain our Observability platform, Datadog, to gather performance data, create complex synthetic tests, and visualize system health via dashboards. ● Leverage existing monitoring solutions such as Grafana and Prometheus while planning and executing the migration or integration of data into a unified platform. ● Document all issues, remediation steps, system architecture, and runbooks to facilitate knowledge transfer and rapid incident response. ● Collaborate closely with Support, Customer Success, Migration, and Professional Services teams to provide the highest level of SaaS service and minimize customer impact during changes. ● Apply a real customer focus when planning deployments/updates, always considering the impact on the end-user before making changes. YOUR EXPERIENCE: Essential Skills and Qualifications ● Hands-on AWS Cloud Engineer experience, with expert working knowledge of the AWS Cloud ecosystem, including a good understanding of AWS IAM roles and policies. ● Proficiency with container orchestration technologies: EKS/Kubernetes (K8s). ● Demonstrable experience with Infrastructure as Code (IaC) tools, specifically Terraform and Helm. ● Working experience with Docker and maintaining systems using Docker Swarm. ● Expertise in setting up and managing logging and monitoring solutions. Direct experience with Datadog is highly preferred, with experience in setting up APM, infrastructure monitoring, and custom dashboards. ● Experience with existing monitoring solutions such as Grafana and Prometheus is required. ● Proficient in a Linux environment and strong skills in Bash and/or Python scripting for automation and troubleshooting. ● A strong understanding of web technologies, including REST APIs, Systems Architecture, Design, and Databases. ● Experience in Product/Application Support for high-availability SaaS-based products. ● Experience in designing, implementing, and operating in a DevSecOps environment. ● Excellent oral and written communication skills, with the ability to clearly explain complex technical issues and RCAs to both technical and customer-facing audiences.

Apply in 3 Minutes: Site Reliability Engineer

4 weeks ago

Hyderabad, India SID Global Solutions Full time

Job Role: Site Reliability Engineer (SRE) – GCP Experience: 3+ years Location: Hyderabad About SIDGS: SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
Site Reliability Engineer

11 hours ago

Hyderabad, India Whatjobs IN C2 Full time

Job Title: Site Reliability Engineer (SRE) | Fintech | Kubernetes | Datadog | 24/7 Support Department: Site Reliability Engineering Location: Hyderabad, India Employment Type: Full-Time Notice period: 0-15 Days We’re hiring a Site Reliability Engineer to join our SRE team focused on maintaining the performance, reliability, and availability of our fintech...
Site Reliability Engineer

1 week ago

Hyderabad, India Whatjobs IN C2 Full time

Job Role: Site Reliability Engineer (SRE) – GCP Experience: 3+ years Location: Hyderabad About SIDGS: SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
▷ Apply in 3 Minutes! Engineer, Site Reliability

4 weeks ago

Hyderabad, India ANSR Full time

ANSR is hiring for one of its clients. About T-Mobile: T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and...
[Apply in 3 Minutes] Engineer, Site Reliability

4 weeks ago

Hyderabad, India ANSR Full time

About T-Mobile: T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience. About TMUS Global...
Lead Site Reliability Engineer

4 days ago

Hyderabad, India Whatjobs IN C2 Full time

Job Description : We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team. Bachelor’s degree in computer science, Engineering, or equivalent practical experience. 7+ years’ experience in Site Reliability deploying and managing large-scale distributed systems successfully. Understanding of SRE concepts (error...
Site Reliability Engineer

2 weeks ago

Hyderabad, India Whatjobs IN C2 Full time

About the Role We are seeking an experienced Site Reliability / Azure DevOps Engineer with Dynatrace Experience to join our engineering team and contribute to scalable CI/CD practices, infrastructure automation, and cloud operations. The ideal candidate will have deep expertise in Azure DevOps, Infrastructure as Code (IaC), Azure services, and modern DevOps...
Site Reliability Engineer

1 week ago

hyderabad, India SID Global Solutions Full time

Job Role: Site Reliability Engineer (SRE) – GCPExperience: 3+ yearsLocation: HyderabadAbout SIDGS:SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
Site Reliability Engineer

1 week ago

Hyderabad, India SID Global Solutions Full time

Job Role: Site Reliability Engineer (SRE) – GCPExperience: 3+ yearsLocation: HyderabadAbout SIDGS:SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management,...
Apply in 3 Minutes! Lead Site Reliability Engineer

4 weeks ago

Hyderabad, India S&P Global Full time

Job Description About The Role Grade Level (for internal use): 11 Job Title: Senior Site Reliability Engineer Role Overview As a Site Reliability Engineer at ChartIQ, you'll play a critical role not only in building, maintaining, and scaling the infrastructure that supports our Development our Development and QA needs, but also in driving new, exciting...

Americas

Europe

Asia / Oceania

Africa

▷ (Apply in 3 Minutes) Site Reliability Engineer