
MindCraft Software
2 days ago
SRE (Site Reliability Engineer)Exp: 5-7 yearsLocation: Thane- 5+ years in SRE or DevOps roles supporting high-scale platforms (fintech, OTT, ecommerce, net banking).- Expertise in uptime and troubleshooting distributed systems (Redis, Golang, DocDB).- Strong networking skills, including network and DNS troubleshooting.- Experience with monitoring/APM tools (Kibana, Grafana, Instana, Dynatrace).- Hands-on with container orchestration on AWS EKS and Red Hat OpenShift.- Proficiency in CI/CD, cloud infrastructure (AWS/Azure), and infrastructure automation.Preferred :- Experience operating highly available, scalable platforms.- Relevant AWS/Azure or SRE certifications.- A proactive Site Reliability Engineer (SRE) to ensure 99.99% uptime for our scalable, multi-tier microservices platform. - You will troubleshoot both networking and application uptime issues, supporting seamless service delivery.Key Responsibilities :- Maintain strict SLOs (99.99% uptime) across distributed systems including Redis, Golang services, and DocDB.- Diagnose and resolve complex application and network issues, including DNS troubleshooting and network latency.- Use monitoring and observability tools such as Kibana, Grafana, Instana, and Dynatrace for proactive incident detection.- Automate infrastructure and workflows with Python, Bash, Terraform, and Ansible.- Manage container orchestration on AWS Elastic Kubernetes Service (EKS) and Red Hat OpenShift, ensuring high availability and scalability.- Collaborate with development and QA teams to embed reliability best practices and improve system observability.- Participate in on-call rotations, incident response, and blameless postmortems.- Document runbooks and mentor junior engineers on SRE and networking fundamentals. (ref:hirist.tech)