Site Reliability Engineer
4 weeks ago
Job Description Siemens Digital Industries Software is a leading provider of solutions for the design, simulation, and manufacture of products across many different industries. Formula 1 cars, skyscrapers, ships, space exploration vehicles, and many of the objects we see in our daily lives are being conceived and manufactured using our Product Lifecycle Management (PLM) software. The DISW SRE organization is dedicated to enhancing service and application availability, optimizing processes by automating manual and repetitive tasks, and addressing complex technical challenges in a dynamic, collaborative, inclusive, and iterative environment. This position plays a crucial role in developing automated solutions and processes that support and sustain best-in-class cloud-based applications. The candidate will support the Siemens Xcelerator platform and will be for coordinating major incident response, maintaining partner communication during service-impacting events, and facilitating resolution in compliance with service level agreement (SLA). Strong communication & coordination skills are necessary to support core objectives. This roles success will be defined by product teams within DISW business units meeting their SLAs. Key Responsibilities - Incident Management: Act as the primary point of contact and leader during major incidents, coordinating the response, communication, and resolution efforts across all involved teams. - Incident Response: Quickly assess the severity of incidents, determine the impact, and drive the appropriate response to restore services as quickly as possible. - Communication: Ensure clear, concise, and timely communication with stakeholders, including technical teams, management, and customers, throughout the incident lifecycle. - Post-Incident Analysis: Lead post-incident reviews to identify root causes, drive improvements, and implement preventive measures to reduce the likelihood of recurrence. - Collaboration: Work closely with SRE, DevOps, Development, and other relevant teams to ensure that incident management processes are well-defined and continuously improved. - Training & Preparedness: Conduct regular incident response drills, train teams on incident management processes, and ensure readiness for handling high-severity incidents. - Documentation: Maintain and update incident management documentation, ensuring that all procedures are up-to-date and accessible to all relevant teams. - Monitoring & Alerts: Collaborate with SRE and monitoring teams to define and refine alerting criteria, ensuring that incidents are detected and escalated promptly. - Continuous Improvement: Find opportunities to improve system reliability, scalability, and performance based on lessons learned from incidents. - 24x7 On-call rotation: Participate in 24x7 on-call rotation. Qualifications: - Technical Skills: Familiar with cloud infrastructure (AWS, GCP, Azure), containerization (Docker, Kubernetes) - Certifications: Relevant certifications (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator) are a plus. - Automation: Experience with automation tools and scripting languages (e.g., Python, Bash) to streamline incident response and remediation. - Stakeholder Management: Experience aligning with cross-functional teams including business and product stakeholders during and after incidents. - Metrics Ownership: Ability to define and track incident-related critical metrics (e.g., MTTR, MTTD) to drive accountability and improvement. - Experience: Enterprise IT environment with distributed environments - Communication: Outstanding English communication skills, both verbal and written, as well as, listening and synthesis skills. - Incident Response: Quickly assess the severity of incidents, determine the impact, and drive the appropriate response to restore services as quickly as possible. - Problem-Solving: Excellent troubleshooting and problem-solving skills, with the ability to quickly analyze complex systems. - Calm Under Pressure: Ability to remain calm, focused, and effective in high-pressure situations. The ability to make quick, confident decisions. - Leadership: Demonstrated experience in leading incident response efforts and managing cross-functional teams during critical situations. - Technical Skills: Familiar with Jira Service management (or equivalent i.e. ServiceNow), Datadog (or equivalent i.e. Grafana), PagerDuty (or equivalent), Atlassian Status page (or equivalent). - Driven Learner: Highly motivated and driven to learn new technologies, skills, and methodologies, continuously seeking to expand your knowledge and adapt to evolving industry trends. - Must be willing and available to work the core hours required A collection of over 377,000 minds building the future, one day at a time in over 200 countries. We're dedicated to equality, and we welcome applications that reflect the diversity of the communities we work in. All employment decisions at Siemens are based on qualifications, merit, and business need. Bring your curiosity and creativity and help us shape tomorrow We offer a comprehensive reward package which includes a competitive basic salary, bonus scheme, generous holiday allowance, pension, and private healthcare. Transform the everyday Accelerate transformation #SWSaaS
-
Site Reliability Engineer
4 days ago
Pune, India UBS Full timeJob Description Job Reference # 326131BR Job Type Full Time Your role We are seeking a highly experienced Site Reliability Engineer (SRE) to join our technology team in a mission-critical financial environment. This role is ideal for someone who has a proven track record of building and operating reliable, scalable systems in regulated industries such as...
-
Site Reliability Engineer
3 days ago
India Pagos Consultants Full timewe are looking for experienced site reliability engineers to join a founding team of startup-minded individuals that will lay the groundwork for our new fintech offering. This team will play a pivotal role in spearheading innovation. As such, you will have the opportunity to shape the early architecture and design of the system and set the trajectory for its...
-
Site Reliability Engineer
3 days ago
india Pagos Consultants Full timewe are looking for experienced site reliability engineers to join a founding team of startup-minded individuals that will lay the groundwork for our new fintech offering. This team will play a pivotal role in spearheading innovation. As such, you will have the opportunity to shape the early architecture and design of the system and set the trajectory for its...
-
Site Reliability Engineer
3 days ago
India Pagos Consultants Full timewe are looking for experienced site reliability engineers to join a founding team of startup-minded individuals that will lay the groundwork for our new fintech offering. This team will play a pivotal role in spearheading innovation. As such, you will have the opportunity to shape the early architecture and design of the system and set the trajectory for its...
-
Site Reliability Engineer
3 days ago
India Pagos Consultants Full timewe are looking for experienced site reliability engineers to join a founding team of startup-minded individuals that will lay the groundwork for our new fintech offering. This team will play a pivotal role in spearheading innovation. As such, you will have the opportunity to shape the early architecture and design of the system and set the trajectory for its...
-
Site Reliability Engineer
2 weeks ago
Pune, India emagine Full timeJob Description Job Overview: As a Site Reliability Engineer (SRE) working in a 24/7 shift rotation, you will be responsible for ensuring the reliability, availability, and performance of critical systems and services. You will combine strong technical skills with operational excellence to proactively monitor, troubleshoot, and resolve issues. Your expertise...
-
Site Reliability Engineer
1 week ago
Pune, India NR Consulting Full timeJob Description ```html About the Company We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in Google Cloud Platform (GCP) and CI/CD automation to lead cloud infrastructure initiatives. The ideal candidate will design and implement robust CI/CD pipelines, automate deployments, ensure platform reliability, and drive...
-
Site Reliability Engineer
2 weeks ago
Pune, India Talent Worx Full timeSite Reliability Engineer (SRE) At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...
-
Site Reliability Engineer
1 week ago
Pune, India Siemens Digital Industries Software Full timeJob Description Siemens Digital Industries Software is a leading provider of solutions for the design, simulation, and manufacture of products across many different industries. Formula 1 cars, skyscrapers, ships, space exploration vehicles, and many of the objects we see in our daily lives are being conceived and manufactured using our Product Lifecycle...
-
Site Reliability Engineer
1 week ago
India Datum Technologies Group Full timeJob Title: Site Reliability Engineer (SRE) – AWS Experience: 8+ years Location: Chennai / Mumbai Work Mode: Hybrid Key Skills: AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, Datadog Job Summary: We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experience and a solid background in DevOps, automation, observability, and...