Site Reliability Engineering Manager

2 weeks ago


Mumbai, India Netcore Cloud Full time

Job Title: Manager of SRE (Site Reliability Engineering) & Application Support

Location: Thane

Reports to: Sr VP Delivery head

Department: Engineering ; Full-Time


About us:

At Netcore, innovation isn’t just a buzzword—it's the core of everything we do. As the pioneering force behind the first and leading AI/ML-powered Customer Engagement and Experience Platform (CEE), we're dedicated to revolutionizing how B2C brands interact with their customers. Our state-of-the-art SaaS products are designed to foster personalized engagement throughout the entire customer journey, creating remarkable digital experiences for businesses of all sizes.

Engineering at Netcore: Dive into a world where your work directly impacts engagement, conversions, revenue, and customer retention. Our engineering team tackles complex challenges that come with scaling high-performance systems. We thrive on versatility and speed, employing advanced tech stacks such as Kafka, Storm, RabbitMQ, Celery, RedisQ, and GoLang, all hosted robustly on AWS and GCP clouds. At Netcore, you're not just solving technical problems—you're setting industry benchmarks.


Job Summary:

We are seeking a seasoned leader for our SRE & Application Support division, overseeing the reliability, scalability, and efficient operation of our martech tools built on open-source frameworks. This role will play a key part in maintaining the operational stability of our products on Netcore Cloud's infrastructure, ensuring 24/7 availability, and driving incident management.

The ideal candidate will combine strong leadership abilities with a deep understanding of site reliability, automation, performance monitoring, and application support, delivering world-class service to our clients and partners.


Key Responsibilities:

SRE Leadership & Strategy:

- Lead the Site Reliability Engineering (SRE) team to design and implement robust systems ensuring uptime, scalability, and security.

- Develop and maintain strategies for high availability, disaster recovery, and capacity planning of all Martech tools.

- Advocate and apply the principles of automation to eliminate repetitive tasks and improve efficiency.

- Establish and refine Service Level Objectives (SLOs), and Service Level Agreements (SLAs) in collaboration with product and engineering teams.

Application Support:

- Oversee and lead the Application Support Team responsible for maintaining the health and performance of customer-facing applications built on the NetcoreCloud platform.

- Develop processes and Debugging procedures to ensure quick resolution of technical issues, and serve as an escalation point for critical incidents.

- Ensure all incidents are triaged and handled efficiently, with proper root cause analysis and follow-up post-mortems for critical incidents.

- Manage the implementation of monitoring tools and log management systems to detect, alert, and respond to potential issues proactively.

Collaboration and Cross-Functional Leadership:

- Work closely with Sales, CSM, Customer Support, development, QA, and DevOps teams.

- Collaborate with stakeholders to drive a culture of continuous improvement by identifying and eliminating potential risks and issues in the system.

- Be involved in PI (Program Increment) planning to align with product roadmaps, making sure reliability is factored into new feature development.

Team Management & Development:

- Recruit, mentor, and manage the SRE and Application Support Team, fostering a high-performance and collaborative environment.

- Conduct regular performance reviews, provide feedback, and support professional development within the team.

Innovation and Open-Source Contribution:

- Lead initiatives to improve the open-source frameworks utilized in the martech stack, contributing to the open-source community as needed.

- Stay current with emerging technologies, tools, and best practices in site reliability, automation, and application support.


Requirements:

Experience:

- 8+ years of experience in SRE, DevOps, or Application Support roles, with at least 3 years in a leadership position.

- Proven track record of managing systems on open-source frameworks and cloud platforms such as NetcoreCloud or similar.

- Demonstrated expertise in incident management, post-mortem analysis, and improving mean time to recovery (MTTR).

- Strong experience in monitoring tools (Prometheus, Grafana, or similar), logging frameworks, and automation tools (Terraform, Ansible).

Technical Skills:

- Hands-on experience with Linux/Unix environments, cloud services (AWS, GCP, NetcoreCloud).

- Proficiency in scripting and coding (Python, Php, Golang, Java, or similar languages) for automation purposes.

- Solid understanding of CI/CD pipelines, version control (Git), and Alert & Application monitoring tools.

Leadership & Soft Skills:

- Proven leadership skills, with experience in team building, mentorship, and fostering a culture of accountability.

- Strong interpersonal and communication skills, with the ability to interface effectively with technical and non-technical stakeholders.

- Ability to manage multiple projects simultaneously, prioritize tasks, and work under pressure to meet deadlines.


Preferred Qualifications:

- Experience in the martech, Digital Marketing domain or working with large-scale, customer-facing SaaS applications.

- Certification in SRE, DevOps, or cloud platforms (AWS, GCP).

- Good application debugging skills, Product feature understanding skills.


Why Join Us?

- Be a part of an innovative and forward-thinking organization that values technology and continuous improvement.

- Work with cutting-edge open-source frameworks and cloud technologies., SAAS Product.

- Leadership opportunities with a direct impact on our customers and product success.


Let's start a conversation and make magic happen together

Website -



  • Mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Apigee Level 1 Experience: 2 to 6 years The Site Reliability Engineer (SRE) Level 1 will be responsible for maintaining and improving the reliability, availability, and performance of the systems. This entry-level role is ideal for someone who passionate about learning and developing their skills in...


  • Mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Apigee Level 1Experience: 2 to 6 yearsThe Site Reliability Engineer (SRE) Level 1 will be responsible for maintaining and improving the reliability, availability, and performance of the systems. This entry-level role is ideal for someone who passionate about learning and developing their skills in system...


  • Mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Apigee Level 1 Experience: 2 to 6 years The Site Reliability Engineer (SRE) Level 1 will be responsible for maintaining and improving the reliability, availability, and performance of the systems. This entry-level role is ideal for someone who passionate about learning and developing their skills in...


  • Mumbai, India dentsu Full time

    The purpose of this role is to ensure the availability and stability of production and test platforms. Job Title: Site Reliability Engineer Job Description: Key responsibilities:Troubleshoots and owns issues in our development, test and production environments. Including performance optimisation and continuous tuningWorks alongside the DevOps team in...


  • Mumbai, India FatakPay Digital Pvt. Ltd. Full time

    Job Summary :We are looking for a Site Reliability Engineer to help ensure the reliability, scalability, and performance of our systems. You will focus on monitoring, incident management, and continuous improvement of our :- Monitor system health and uptime using industry-standard tools.- Design and implement incident management processes.- Optimize system...


  • Mumbai, India FatakPay Digital Pvt. Ltd. Full time

    Job Summary :We are looking for a Site Reliability Engineer to help ensure the reliability, scalability, and performance of our systems. You will focus on monitoring, incident management, and continuous improvement of our :- Monitor system health and uptime using industry-standard tools.- Design and implement incident management processes.- Optimize system...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this Role Act as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed. Deployment of large distributed application in Production/Staging environment. Run the production environment by monitoring availability and taking a holistic view of application...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this RoleAct as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed.Deployment of large distributed application in Production/Staging environment.Run the production environment by monitoring availability and taking a holistic view of application and system...


  • Navi Mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Level 1 & 2 Working days : Work from Office (5 days compulsory) Shift Timings : Rotational Shifts Looking only for #Male candidates and Immediate Joiners. Key Responsibilities: • Monitor system performance and availability across GCP and Anthos environments. • Respond to incidents,...


  • navi mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Level 1 & 2 Working days : Work from Office (5 days compulsory) Shift Timings : Rotational Shifts Looking only for #Male candidates and Immediate Joiners. Key Responsibilities: • Monitor system performance and availability across GCP and Anthos environments. • Respond to incidents, perform root cause...


  • navi mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Level 1 & 2Working days : Work from Office (5 days compulsory)Shift Timings : Rotational ShiftsLooking only for #Male candidates and Immediate Joiners.Key Responsibilities:• Monitor system performance and availability across GCP and Anthos environments.• Respond to incidents, perform root cause...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this RoleAct as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed.Deployment of large distributed application in Production/Staging environment.Run the production environment by monitoring availability and taking a holistic view of application and system...


  • Navi Mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Level 1 & 2Working days : Work from Office (5 days compulsory)Shift Timings : Rotational ShiftsLooking only for #Male candidates and Immediate Joiners.Key Responsibilities:• Monitor system performance and availability across GCP and Anthos environments.• Respond to incidents, perform root cause...


  • navi mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Level 1 & 2Working days : Work from Office (5 days compulsory)Shift Timings : Rotational ShiftsLooking only for #Male candidates and Immediate Joiners.Key Responsibilities:• Monitor system performance and availability across GCP and Anthos environments.• Respond to incidents, perform root cause...


  • Mumbai, India Ascendion Full time

    About Ascendion:Ascendion is an ally for clients seeking enterprise digital innovation. We make and manage software platforms and products that power growth and deliver captivating experiences. By embracing the future of work, we bring creativity and execution excellence together to make digital transformation valuable (and even fun). Our engineering, cloud,...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this Role Act as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed. Deployment of large distributed application in Production/Staging environment. Run the production environment by monitoring availability and taking a holistic view of application and...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this RoleAct as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed.Deployment of large distributed application in Production/Staging environment.Run the production environment by monitoring availability and taking a holistic view of application and system...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this RoleAct as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed.Deployment of large distributed application in Production/Staging environment.Run the production environment by monitoring availability and taking a holistic view of application and system...


  • Mumbai, India Azilen Technologies Full time

    Objectives of this Role Act as the primary point of contact for corporate clients, delivering timely, professional support and ensuring seamless on-site service as needed. Deployment of large distributed application in Production/Staging environment. Run the production environment by monitoring availability and taking a holistic view of application...


  • Navi Mumbai, India SID Global Solutions Full time

    Job Description: Site Reliability Engineer (SRE) – Level 1 & 2Working days : Work from Office (5 days compulsory)Shift Timings : Rotational ShiftsLooking only for #Male candidates and Immediate Joiners. Key Responsibilities:• Monitor system performance and availability across GCP and Anthos environments.• Respond to incidents, perform root cause...