Site Reliability Engineer
3 weeks ago
YOUR IMPACT: Reliability, Automation, and Observability As a hybrid Site Reliability Engineer/DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-critical SaaS platform. You'll apply engineering principles to operational challenges, constantly striving to eliminate toil through automation.Operational Excellence & Reliability ● Provide day-to-day management of system alerts, check system health, and escalate issues as necessary to maintain high availability.● Actively participate in a 24x7 on-call rotation for critical SaaS platform incidents, and be available in case of emergencies.● Lead the incident response process, ensuring fast and effective mitigation and resolution of production issues.● Perform thorough Root Cause Analysis (RCA) and lead blameless post-mortems to identify systemic weaknesses and create a corrective action plan to prevent recurrence.● Collaborate with engineering teams to set and enforce error budgets (derived from SLOs, or Service Level Objectives), ensuring a healthy balance between development speed and system stability.Platform Automation & Infrastructure Development ● Automate routine operational tasks to reduce manual effort and "toil" and increase overall team efficiency.● Design, deploy, and maintain cloud infrastructure using Infrastructure as Code (IaC), specifically leveraging Terraform and Helm for deployment to EKS/K8s clusters.● Improve existing infrastructure health by developing and implementing checks and scripts to proactively correct known issues and self-heal the platform.● Maintain, develop, and evolve our Continuous Integration/Continuous Delivery (CI/CD) deployment code and pipelines.● Learn and maintain existing infrastructure running under Docker and Docker Swarm while driving migration strategies toward EKS/K8s.● Implement and integrate new technologies and services into our Cloud Infrastructure to enhance platform capabilities and resilience.Monitoring & Observability ● Design and implement comprehensive Observability strategies across all three pillars: Metrics, Logs, and Traces.● Proactively create and refine robust monitoring and alerting configurations within the EKS/K8s ecosystem.● Utilize and maintain our Observability platform, Datadog, to gather performance data, create complex synthetic tests, and visualize system health via dashboards.● Leverage existing monitoring solutions such as Grafana and Prometheus while planning and executing the migration or integration of data into a unified platform.● Document all issues, remediation steps, system architecture, and runbooks to facilitate knowledge transfer and rapid incident response.● Collaborate closely with Support, Customer Success, Migration, and Professional Services teams to provide the highest level of SaaS service and minimize customer impact during changes.● Apply a real customer focus when planning deployments/updates, always considering the impact on the end-user before making changes.YOUR EXPERIENCE: Essential Skills and Qualifications ● Hands-on AWS Cloud Engineer experience, with expert working knowledge of the AWS Cloud ecosystem, including a good understanding of AWS IAM roles and policies.● Proficiency with container orchestration technologies: EKS/Kubernetes (K8s).● Demonstrable experience with Infrastructure as Code (IaC) tools, specifically Terraform and Helm. ● Working experience with Docker and maintaining systems using Docker Swarm.● Expertise in setting up and managing logging and monitoring solutions. Direct experience with Datadog is highly preferred, with experience in setting up APM, infrastructure monitoring, and custom dashboards.● Experience with existing monitoring solutions such as Grafana and Prometheus is required.● Proficient in a Linux environment and strong skills in Bash and/or Python scripting for automation and troubleshooting.● A strong understanding of web technologies, including REST APIs, Systems Architecture, Design, and Databases.● Experience in Product/Application Support for high-availability SaaS-based products.● Experience in designing, implementing, and operating in a DevSecOps environment.● Excellent oral and written communication skills, with the ability to clearly explain complex technical issues and RCAs to both technical and customer-facing audiences.
-
Site Reliability Engineer
3 weeks ago
Hyderabad, Telangana, India, Telangana Sonata Software Full timeRole:Site Reliability Engineer Location:HyderabadNotice Period: Immediate to 20 Days Employment Type:Full TimeExperience7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)Primary Skills (Must-Have)AWS, CI/CD, Jenkins, IAAC, Terraform,...
-
Site Reliability Engineer
3 weeks ago
Hyderabad, Telangana, India, Telangana Sonata Software Full timeCategoryDetailsRoleSite Reliability Engineer (SRE) III – Data EngineeringLocationHyderabad- Employment TypeFull TimeExperience7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)Primary Skills (Must-Have)AWS, CI/CD, Jenkins, IAAC,...
-
Site Reliability Engineer
3 weeks ago
Hyderabad, Telangana, India, Telangana Insight Global Full timeJob Description:Title: Site Reliability EngineerLocation: Hyderabad (4 days onsite and 1 day remote)Required Skills & Experience:Bachelor's degree in computer science, Engineering, or related field5+ years of experience in SRE or related rolesProficiency in Python and experience with Kubernetes and KafkaExperience with Ignition SCADA and RESTful APIsStrong...
-
Site Reliability Engineer
6 days ago
Hyderabad, Telangana, India Oracle Financial Services Software Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per yearPrincipal Site Reliability Engineer Oracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and...
-
Site Reliability Engineer
2 hours ago
Hyderabad, Telangana, India Oracle Financial Services Software Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per yearPrincipal Site Reliability Engineer Oracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Mainframe zLinux, DB2, zVM, AIX. Site Reliability Engineer expected to work with multiple service and product development teams,...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Jigya Software Services Full time ₹ 1,50,000 - ₹ 28,00,000 per yearJob Title:Senior Site Reliability Engineer (SRE) - AWS/KubernetesLocation:Hyderabad - OnsiteJob Type:Full-TimeAbout the Role:We are looking for a highly skilled and motivated Site Reliability Engineer to design, build, and maintain our high-performance, scalable cloud infrastructure. You will play a critical role in ensuring the reliability, performance, and...
-
Site Reliability Engineer
3 hours ago
Hyderabad, Telangana, India Technology Next Full time ₹ 20,00,000 - ₹ 30,00,000 per yearUrgently hiring for Site Reliability Engineer (SRE) / Chaos EngineerLocation: HyderabadJob Type: Full-time, PermanentJob Description:We are looking for an experienced Site Reliability Engineer (SRE) with strong Python automation skills (Boto3 required) and hands-on experience in chaos engineering to improve system reliability and resilience. The ideal...
-
Site Reliability Engineer
6 days ago
Hyderabad, Telangana, India SMARTWORK IT SERVICES Full time ₹ 12,00,000 - ₹ 24,00,000 per yearDescription : Role : Site Reliability Engineer (SRE). Location : Hyderabad. Experience : 10 to 15 Years. Job Summary : The Site Reliability Engineer (SRE) will play a critical role in ensuring the reliability, scalability, and performance of Citizens Banks enterprise systems and cloud environments. The ideal candidate brings deep technical...
-
Site Reliability Engineer
18 minutes ago
Hyderabad, Telangana, India Apple Full time ₹ 15,00,000 - ₹ 25,00,000 per yearImagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn't have imagined — and now can't imagine living without. If you're motivated by the idea of making a real impact, and joining a team where we pride ourselves in being one of the most diverse...
-
Site Reliability Engineer
2 weeks ago
Hyderabad, Telangana, India Evalify-IQ Full time ₹ 6,00,000 - ₹ 18,00,000 per yearSkills Required:AWS, Azure, Terraform, CloudFormation, Cloudformation, Pulumi, CICD, GitHub Actions,GitLab CI, Jenkins, ArgoCD, Prometheus, Splunk, Grafana, Cloudwatch, Datadog, SRE,Site Reliability, Python, Powershell, Shell, Go, Kubernetes, Docker, Performance Tuning,Performance Enhancements, Performance Enhancement, PerformanceExperience Range:2 - 5...