System Reliability Engineer

3 days ago


India ANSR Full time
As a leading food and drug retailer in the United States, Albertsons Companies, Inc. operates over 2,200 stores across 35 states and the District of Columbia. We build and shape technology solutions that solve customers' problems every day, making things easier for them when they shop with us online or in a store. We have made bold, strategic moves to migrate and modernize our core foundational capabilities, positioning ourselves as the first fully cloud-based grocery tech company in the industry.

Our success is built on a one-team approach, driven by the desire to understand and enhance the customer experience. By constantly pushing the boundaries of retail, we are transforming shopping into an experience that is easy, efficient, fun and engaging.

At the Albertsons India Capability Center, we are raising the bar to grow across Technology & Engineering, AI, Digital and other company functions, and transform a 165-year-old American retailer. At Albertsons India Capability Center, associates collaborate directly with international teams, enhancing decision-making processes and organizational agility through exciting and pivotal projects. Your work will make history and help millions of lives each day come together around the joys of food and inspire their well-being.

This role will be an individual contributor responsible for building and finetuning the platform components for the Observability product. The candidate will work closely with the Lead engineer, performance team, data ingestion, platform DevOps and data visualization teams under Observability product. and monitoring technologies.

Experience in Observability and Monitoring initiatives as platform Engineer.

Development and implementation of build release pipelines with accountability for managing deployment schedules, issues, risks, and impediments.

Agile development experience with team member accountability for commitment and delivery each sprint.

Drive collaboration sessions among IT and business groups to facilitate optimal support and operation of the relevant applications

Provide Site Reliability Engineering techniques such as observability, alerting and performance tuning

Contribute to the design, implementation, and enhancement of critical applications

Perform proactive analysis and troubleshooting to predict and prevent production incidents

Define and contribute to monitoring capabilities for critical applications

Collaborate with key vendors on functional, performance and capacity improvements

Design and build tools to automate support and monitoring functions

Ensure that all implementations of observability meet the requirements prescribed by IT Services through the effective implementation or use of approved processes, methodologies, and deliverables.

Able to provide coding and technical direction to less experienced staff or develops highly complex original code.

Track infrastructure delivery and dependencies to implementation.

Experience with gathering and organizing large volume of data to use for instrumentation into an Enterprise Observability solution.

Experience with recommending baseline monitoring thresholds, and performance monitoring KPIs and SLAs.

Experience with installing agents, forwarders, APIs, performance monitoring alerts, dashboards, and data trend analysis.

Good Knowledge and understanding of Azure foundation components e.g. App GW, APIM, Virtual Network, NSG, Load Balancer, Azure VM etc. Experience with Databases Azure SQL, PostgreSQL, MySQL, MongoDB, TSDB or similar databases.

Knowledge of monitoring tools such as Log Analaytics, App Dynamics, Grafana, Prometheus, Splunk, and Sitescope

Azure / GCP hands-on with details around pulling observability data from managed services

Golang / Python coding or from solutioning background with experience on SRE development and Open telemetry implementation

Design and develop standard Grafana dashboards for critical metrics for various Azure/GCP services using the observability data

Experience must include at least one of the following languages: Java (required), Desired--Python, GoLang,

Experience in working with ServiceNow or similar Service Management tools

Familiarity with Cloud technologies in Azure, AWS, and Google Cloud

Experience on PCF, Docker, Kubernetes platform is required.

Experience with DevOps and CI/CD tools and processes is required.

Experience in high-performance and high-frequency data streaming (using Kafka etc.) and handling large volume of batch data is strongly preferred but not required.

Experience with Agile / Scrum methodologies is required.

We believe the successful candidate has these qualifications and experience:

~4-year degree (Computer Science, Information Systems, or relational functional field) and/or equivalent combination of education or work experience.

~1-3+ years of experience on integration engineering related to Observability / Monitoring framework with open source technologies such as Grafana, Mimir, Loki, Tempo, Fluentbit, Vector etc.,

~ Hands-on experience with Tools and Technology is preferred.

~1+ years of experience as a System Reliability Engineer is required.

~ Experience working with Open-source platforms and Open Telemetry libraries e.g.

  • India beBeeFailure Full time ₹ 15,00,000 - ₹ 20,00,000

    Electrochemical Failure Prevention SpecialistThis role involves leading a technical project focused on preventing electrochemical system failures through advanced analysis and troubleshooting. As an Electrochemical Failure Prevention Specialist, you will play a crucial part in identifying and addressing failure mechanisms in electrochemical systems.Key...


  • India beBeeReliability Full time US$ 1,25,000 - US$ 1,75,000

    Reliability EngineerAbout us:We're a Document Workflow platform that converts unstructured documents into structured, actionable data with the help of Agentic Workflows. We have strong backing from investors and are trusted by leading banks and fintechs worldwide.The opportunity as Senior DevOps / SRE Engineer: Join our team to lead a small group of...


  • India beBeeSiteReliabilityEngineer Full time ₹ 9,00,000 - ₹ 12,00,000

    Job Description">We are seeking an experienced Site Reliability Engineer (SRE) to join our platform engineering and operations teams. As an SRE, you will play a key role in ensuring the reliability and efficiency of our infrastructure and services.">As a member of our team, you will work closely with our development teams to identify and resolve issues...


  • India beBeeSystemAdministrator Full time ₹ 1,50,00,000 - ₹ 2,40,00,000

    Job OverviewWe specialize in strategic database and analytics services, driving digital transformation and operational excellence.Why You:Develop and implement large-scale distributed systems across compute, storage, networking, and AI/ML environments.Lead projects from architecture to automation to intelligent monitoring, collaborating with both clients and...


  • India beBeeSRE Full time ₹ 20,00,000 - ₹ 25,00,000

    The role of a Site Reliability Engineer is crucial in ensuring the reliability and performance of our systems. With 2-4 years of relevant experience, you will bring your expertise in Linux and Windows systems to set up alerts, dashboards, and analyze metrics/logs for system performance and reliability.Key SkillsSystem Administration: Extensive knowledge of...


  • India Talent500 Full time

    We believe in opportunities favoring the bold and thus, we help the best tech and non-tech talent find their dream jobs at renowned companies that leads to a transformative experience career wise. Engineer - Site Reliability - FPT As a Site Reliability Engineer, you'll play a crucial role in keeping our digital backbone running seamlessly for millions of...


  • India beBeeSiteReliability Full time ₹ 9,00,000 - ₹ 12,00,000

    About Our Reliability Engineering Team:We are looking for a highly skilled Site Reliability Engineer to join our team. The ideal candidate will have a strong background in software engineering and be able to design, build, and maintain reliable systems.The successful candidate will have excellent problem-solving skills, be able to work effectively in a team...


  • India beBeeCloudReliability Full time ₹ 90,00,000 - ₹ 1,50,00,000

    We are obsessed with delivering exceptional customer experiences by driving growth and innovation in the cloud.Our MissionWe strive to become the world's safest and most reliable cloud service provider through our relentless pursuit of excellence in quality, security, and reliability.Azure Reliability Team OverviewWe are a multidisciplinary team of engineers...


  • India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering rolesLocation – Bangalore/ RemoteType - ContractWork Ex - 4-6 yrsWe're working with a AI product company that's building the next generation of GenAI powered developer platforms.We're looking for an experienced Site Reliability Engineer to join their Platform Engineering...


  • India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering roles Location – Bangalore/ Remote Type - Contract Work Ex - 4-6 yrs We're working with a AI product company that's building the next generation of GenAI powered developer platforms . We're looking for an experienced Site Reliability Engineer to join their Platform...