Site Reliability Engineer

4 weeks ago


Gurgaon, Haryana, India McCain Foods Full time

JOB RESPONSIBILITIES:

● Work with stakeholders such as product owners and Engineering to define

service level objectives (SLOs) for system operations.

● Track performance against SLOs in partnership with monitoring teams or other

stakeholders, and ensure systems continue to meet SLOs over time.

● Create dashboards and reports to communicate key metrics.

● Create software to improve performance, scalability, and stability of systems.

● Collaborate with development teams to promote the concept of reliability

engineering during all phases of the software development lifecycle to detect and

correct performance issues and meet availability goals.

● Design, code, test, and deliver infrastructure software to automate manual

operational work (i.e., "toil").

● Participate in operational support and on-call rotation shifts for supported

systems and products.

● Conduct blameless post mortems to troubleshoot priority incidents.

● Perform analytics on previous incidents to understand root causes and better

predict and prevent future issues.

● Use automation to reduce the probability and/or impact of problem recurrence.

● Identify, evaluate, and recommend monitoring tools and diagnostic techniques to

improve system observability.

● Participate in system design consulting, platform management, capacity planning

and launch reviews.

● Collaborate and share lessons learned regarding performance and reliability

issues with all stakeholders including developers, other SREs, operations teams,

and project management teams.

● Participate in communities of practice to share knowledge and foster continuous

improvement.

● Remain current on site reliability engineering methods and trends such as

observability-driven development and chaos engineering.

● Drive continuous improvement in software quality and infrastructure reliability and

resilience.

● Oversee, design, implement, and manage DevOps capabilities using continuous

integration/continuous delivery toolsets and automation.

● SRE engineer will focus on Application Performance Monitoring (APM) including

Design, Solution, POC, profiling and tuning application compute and data nodes

and resources. Some key duties of this role are:

● Assist in defining SRE and Observability architecture, design

● Analyze, Implement new features of SRE and Observability Platform

● Full stack monitoring across all layers

(Infrastructure/Network/Database/Application/Services/Third Party)

● Provide technical hands-on leadership in commercial and Open

source/commercial monitoring Tool selection Implementation.

● Implement SRE driven automated Incident Detection -> automated Engagement

–> Triage/Mitigate – RCA/Postmortems -> Problem task Remediation.

● AI Driven Correlation, De-duplication Noise Reduction and Auto Remediation

● Provide weekly monitoring and alert analysis and continuous improvement

● Create a model of the run-time environment (discovery)

● Profile the performance and behavior of user-defined transactions

● Establish Performance metrics from each of the applications/systems technical

components (Webserver, App server, Database, etc.)

● Application performance management database

● APM tool Administration and Support

● Monitoring Tool design and implementation

● APM Setup/Usage policies and guidelines

● Capacity Planning and monitoring

● Monitor selected application performance

● Report vital statistics of application performance in production

● Make recommendations for improvements with Service Desk

● Make recommendations for adjustments to runtime resources to improve overall

performance profile

KEY QUALIFICATION & EXPERIENCES:

● Strong problem solving and analytical skills.

● Strong interpersonal and written and verbal communication skills.

● Highly adaptable to changing circumstances. Interest in continuously learning

new skills and technologies.

● Experience with programming and scripting languages (e.g. Java, C#, C++,

Python, Bash, PowerShell).

● Experience with incident and response management.

● Experience with Agile and DevOps development methodologies.

● Experience with container technologies and supporting tools (e.g. Docker

Swarm, Podman, Kubernetes, Mesos).

● Experience with working in cloud ecosystems (Microsoft Azure AWS, Google

Cloud Platform,).

● Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch,

AppDynamics, NewRelic, ELK, Prometheus, OpenTelemetry).

● Experience with configuration management systems (e.g. Puppet, Ansible, Chef,

Salt, Terraform).

● Experience working with continuous integration/continuous deployment tools

(e.g. Git, Teamcity, Jenkin, Artifactory).

● Experience in GitOps based automation is Plus

● Bachelor's degree (or equivalent years of experience).

● 5+ years of relevant work experience. SRE experience preferred.

● Background in Manufacturing, Platform/Tech compnies is preferred.

● Must have Public Cloud provider certifications (Azure, GCP or AWS)

● Having CNCF certification is plus

OTHER INFORMATION

Travel: as required.

The job is primarily performed in a Hybrid office environment.



  • Gurgaon, Haryana, India beBee Careers Full time

    Job Summary">We are seeking a Site Reliability Engineer to join our team. As a member of our team, you will be responsible for ensuring the availability, scalability, and performance of our cloud-based systems.">Main Responsibilities">">Design and implement cloud infrastructure using AWS, GCP, or Azure.">Develop and maintain CI/CD pipelines.">Improve...


  • Gurgaon, Haryana, India myGwork Full time

    This job is with Synechron, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter directly.Overall Summary:We are seeking a skilled and experienced SRE Engineer to join our team. The ideal candidate will have a strong background in Site Reliability Engineering...


  • Gurgaon, Haryana, India Karix Full time

    Role: Site Reliability Engineer (L2 Support)Location: Gurgaon (WFO)About the role: We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Service platforms.They are responsible for monitoring, automating, and improving the...


  • Gurgaon, Haryana, India beBee Careers Full time

    A key member of the team, this Senior Site Reliability Engineer will focus on incident management and troubleshooting, developing and improving monitoring, alerting, and diagnostic tools, and conducting blameless postmortems.The SRE Specialist will also be responsible for automation and infrastructure as code, managing infrastructure using tools like...


  • Gurgaon, Haryana, India Karix Full time

    Role: Site Reliability Engineer (L2 Support)Location: Gurgaon (WFO)About the role: We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Service platforms. They are responsible for monitoring, automating, and improving...


  • Gurgaon, Haryana, India LeewayHertz Full time

    Job DescriptionJob DescriptionThis is a remote position.Job DescriptionAs a Site Reliability Engineer, you will play a crucial role in ensuring our infrastructure and applications stability, scalability, and performance. Leveraging your expertise in automation, monitoring, and incident response, you will collaborate with cross-functional teams to maintain...


  • Gurgaon, Haryana, India beBee Careers Full time

    About the RoleThis position requires a highly skilled Senior Site Reliability Engineer to join our team.Key ResponsibilitiesAudit and analyze system performance to identify areas for improvement.Develop and implement automation scripts to streamline infrastructure management.Collaborate with cross-functional teams to resolve complex technical...


  • Gurgaon, Haryana, India beBee Careers Full time

    Job OverviewWe are seeking a skilled Site Reliability Engineer to join our team. As a key member, you will play a crucial role in ensuring the stability, scalability, and performance of our systems and infrastructure.This is an exciting opportunity for someone who is passionate about monitoring, observability, code quality, and self-healing infrastructures....


  • Gurgaon, Haryana, India beBee Careers Full time

    Job Summary:We are seeking an experienced Site Reliability Engineer to join our team. As a key member of the organization, you will be responsible for architecting pipelines, solutions, and key processes for SRE Operations.About the Role:Architecting pipelines, solutions, and key processes for SRE OperationsDriving key initiatives using data, technology, and...


  • Gurgaon, Haryana, India beBee Careers Full time

    Site Reliability Engineer">We are seeking a highly skilled Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for ensuring the stability and health of our platform.">Our ideal candidate has a strong background in software development and operations, with experience working with SDLC...