Site Reliability Engineer

2 days ago

Bengaluru, Karnataka, India Walmart Full time ₹ 12,00,000 - ₹ 36,00,000 per year

About Team:

Transactional System provides core transactional systems to enable

segment and technology partners in creating wonderful omni experiences with speed and leverage. We are a highly motivated group of engineers, working in an agile group to solve sophisticated and high impact problems. This role is part of Cloud Powered Checkout team and will build the next generation multi-tenant, client agnostic, highly scalable, omnichannel checkout solution to seamlessly enable a frictionless customer checkout experience across all sales channels globally. We process millions of orders daily through our high-performance checkout services running in Edge and Cloud.

As a Site Reliability Engineer in the CPC Team, you will work with L2, Other dependent Applications, Platform team, DevOps and Engineering practitioners to proactively maintain mission-critical infrastructure, cloud platforms, microservices, tools, and processes that will ensure the highest levels of availability and reliability of CPC applications.

Our team works closely with our US stores and eCommerce business to better serve customers by empowering team members, stores, and merchants with technological innovation. From groceries and entertainment to sporting goods and crafts, Walmart U.S. offers an extensive selection that our customers value, whether they shop online at , through one of our mobile apps, or in-store. Focus areas include customers, stores and employees, in-store service, merchant tools, merchant data science, and search and personalization.

What you'll do:

Incident triage, Escalation and Resolution: Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery [Reduce MTTE Mean Time to Engage], and focusing on immediate restoration [ Reduce MTTR Mean Time to Restore] of large-scale enterprise systems.
Alert, Monitoring, Log analysis: Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Service Now, JIRA, Dynatrace, Splunk etc [Reduce MTTD Mean Time to Detect].
Enhance Alerting solutions: Design and implement JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight, Splunk, and xMatters. Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic. To demonstrate awareness of the metrics used to monitor software or system performance. Monitors current performance data to ensure adherence to defined SLOs and SLIs for simple applications/systems. Demonstrates awareness of the different types of alerts generated by the monitoring tools. Demonstrates awareness of infrastructure and application metrics.
Disaster Recovery Planning: Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To work with business partners to identify and document critical applications.
Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Chaos tools to utilize established criteria (for example, probability of failure, frequency of failure) to measure site reliability. Monitors site reliability conditions and new reliability requirements.
Work on Product Enrichment ; Content Services projects at Walmart: Develop enterprise monitoring and utilize tooling software solutions such as Grafana, Splunk etc, to improve visibility, pro-actively detect issues and restore system availability.
Develop Tools and support: Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues with various internal applications with multiple programming languages like Java, JavaScript (React, Node JS), Python and Shell programming technologies like Prometheus, Database Query languages. Design and develop a UI tool to display Item Content Quality data on a dashboard using AngularJS, ReactJs, HTML5 ; CSS3 etc
To create and maintain Playbooks.
Steps to perform correct analysis on the issues and engage correct teams for CPC, Dependent downstream services and Platform teams.
To handle Deployments. Streamline the deployments process and handle the responsibility as a single team. Understand and explore Post validations and back out steps to make app more resilient.
Coordinate with platform teams for non-app releases like VM upgrades, DB Maintenance, and other component environment related tasks.
Participate in rotating on-call duties and work across different time zone with a multi-national team
Responsible for timely root cause analysis [RCA] of production issues.
Develop reusable tooling and processes to drive and improve customer experience and lower operational costs.
Understand DevOps Industry best practices
Help teams to build highly Observable and Resilient systems
Collaborate with developers to capture requirements and understanding pain points
Build reusable tools, library, dashboards which can be used across DevOps/SRE teams

What you'll bring:

Bachelors degree in Computer Science, Engineering or related discipline
5+ years of hands-on related to Site Reliability Engineer, Operations ; Development experience with Java Script, Java, Restful services, Git, Maven, Jenkins, DevOps, Containerization, Docker, Kubernetes, Azure, Google cloud, Kafka, Azure Cosmos, Azure SQL, Mega cache CI/CD ,Prometheus, Grafana, Splunk etc.
Automation and Self-healing: Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments. Help enhance existing solutions by developing automation with Docker, Kubernetes and working with DevOps and Engineering partners.
Excellent end to end technical understanding of core infrastructure, cloud services, platforms, and micro-services.
Ability to effectively triage be able to detect and determine symptom vs cause.
Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
Influence the design of system architecture and tactical solutions.
Familiar with log centric tooling. Produce time series data and reusable dashboards for use both during and post event.

About Walmart Global Tech

Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. Thats what we do at Walmart Global Tech. We're a team of software engineers, data scientists, cybersecurity expert's and service professionals within the world's leading retailer who make an epic impact and are at the forefront of the next retail disruption. People are why we innovate, and people power our innovations. We are people-led and tech-empowered.

We train our team in the skillsets of the future and bring in experts like you to help us grow. We have roles for those chasing their first opportunity as well as those looking for the opportunity that will define their career. Here, you can kickstart a great career in tech, gain new skills and experience for virtually every industry, or leverage your expertise to innovate at scale, impact millions and reimagine the future of retail.

Benefits

Beyond our great compensation package, you can receive incentive awards for your performance. Other great perks include a host of best-in-class benefits maternity and parental leave, PTO, health benefits, and much more.

Belonging

We aim to create a culture where every associate feels valued for who they are, rooted in respect for the individual. Our goal is to foster a sense of belonging, to create opportunities for all our associates, customers and suppliers, and to be a Walmart for everyone.

At Walmart, our vision is "everyone included." By fostering a workplace culture where everyone is—and feels—included, everyone wins. Our associates and customers reflect the makeup of all 19 countries where we operate. By making Walmart a welcoming place where all people feel like they belong, we're able to engage associates, strengthen our business, improve our ability to serve customers, and support the communities where we operate.

Equal Opportunity Employer

Walmart, Inc., is an Equal Opportunities Employer – By Choice. We believe we are best equipped to help our associates, customers and the communities we serve live better when we really know them. That means understanding, respecting and valuing unique styles, experiences, identities, ideas and opinions – while being inclusive of all people.

Site Reliability Engineering

2 weeks ago

Bengaluru, Karnataka, India Thakral One Full time US$ 60,000 - US$ 1,20,000 per year

Company DescriptionThakral One, headquartered in Singapore, is a technology consulting and services company with a strong presence across Asia. The company specializes in technology-driven consulting, custom solution development, data analytics, and leveraging cloud capabilities to deliver enhanced decision support and practical outcomes. Collaborating...
Site Reliability Engineering

2 weeks ago

Bengaluru, Karnataka, India Viraaj HR Solutions Private Limited Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Site Reliability Engineer (SRE)About The OpportunityA fast-growing organization in the Enterprise Cloud Infrastructure & SaaS sector delivering highly available, mission-critical services to enterprise customers. We are hiring an on-site Site Reliability Engineer in India to own reliability, automation, and operational excellence across cloud-native...
Site Reliability Engineer

5 days ago

Bengaluru, Karnataka, India super Full time ₹ 12,00,000 - ₹ 24,00,000 per year

Site Reliability Engineer (SRE) Level 3Overview:A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and highly reliable systems. This role emphasizes a blend of software and systems engineering to ensure the availability, latency, performance, and capacity...
Site Reliability Engineer

3 days ago

Bengaluru, Karnataka, India eBay Full time ₹ 12,00,000 - ₹ 36,00,000 per year

At eBay, we're more than a global ecommerce leader — we're changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We're committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.Our customers are our compass, authenticity...
Site Reliability Engineer

7 days ago

Bengaluru, Karnataka, India Zetamicron Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Job Title: Site Reliability Engineer (SRE)About the RoleWe are seeking a highly skilled and proactive Site Reliability Engineer (SRE)to ensure the stability, scalability, and reliability of our platform. The ideal candidate will have strong experience in managing production environments, automating operational processes, and enhancing system performance...
Site Reliability Engineer

2 days ago

Bengaluru, Karnataka, India Barycenter Technologies Full time ₹ 5,00,000 - ₹ 15,00,000 per year

Job Description: Site Reliability Engineer (SRE)Must have skills :Kubernetes (Networking, storage), python & Linux.Good to Have skills:Reporting and Monitoring Tools (Grafana, Loki, Dynatrace)
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India Chevron Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Total Number of Openings2About the position:Come join our Subsurface Digital Platform where we are driving continuous innovations to improve reliability, scalability and sustainability of Chevron business via Chevron's Digital Transformation. We are seeking a T-shaped dynamic Senior Site Reliability Engineer to lead and provide end-to-end solution support...
Site Reliability Engineer

5 days ago

Bengaluru, Karnataka, India Luxoft Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Project description Luxoft partner with next-generation digital bank, built from the ground up to deliver seamless, secure, and scalable financial services. Our platform is cloud-native, API-first, and focused on reliability, speed, and security. We are growing fast and looking for top-tier Site Reliability / Ops Engineers to join our core team and help run...
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India Empower Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and...
Site Reliability Engineer

2 weeks ago

Bengaluru, Karnataka, India d416f97b-2589-437a-8e64-3348cfe4008b Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Hiring Site Reliability EngineersExp : 2.5 +years [Excluding internship]Location : BangaloreApply Here : The engineer will work in the Reliability and Productivity Engineering team and is responsible for building industry standard large scale platforms to be utilised across FK that helps to significantly improve the reliability of systems and bring...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer