Site Reliability Engineer

3 weeks ago

Pune, Maharashtra, India Uplers Full time

Job Description

Must have skills required :

Azure DevOps, SRE concepts, TerraData, CDC, CDC tool, NEWREL

Good to have skills :

Aws cloudwatch

Reflections Info Systems (One of Uplers Clients) is Looking for:

Site Reliability Engineer who is passionate about their work, eager to learn and grow, and who is committed to delivering exceptional results. If you are a team player, with a positive attitude and a desire to make a difference, then we want to hear from you.

Role Overview Description

As a Site Reliability Engineer (SRE) you will be responsible for improving the overall reliability of applications by ensuring its availability, performance, and scalability. Should be able to gather the technical requirements from the DevOps team and the operational requirements from the Application Support team. With the Site Reliability Engineer role being at the heart of solving production problems, should be able to take a holistic approach to troubleshooting and delve deeply into technical details and must acquire the necessary domain knowledge to effectively troubleshoot and recover from an outage as well as monitor applications in production and build alerts as required.

Responsibilities include:

Work closely with the application support team.

Monitor critical applications and services to minimize downtime and ensure their availability.

Collaborate with DevOps teams to maintain and monitor CI/CD pipelines.

Deploy new versions to production environments.

Work with project teams to ensure the reliability and maintainability of new and modified releases.

Provide input to risk management practices that will anticipate reliability-related incidents that could adversely impact operations.

Document processes and monitor application performance metrics.

Continuously improve proactive monitoring alert configuration and incident response processes to increase reliability and reduce Mean Time to Recovery (MTTR ).

Optimize performance and cost efficiency through continuous monitoring, trend analysis, and fine-tuning.

Monitor any abnormal usage that can impact the cost or performance and take corrective actions.

Proactively implement preventive measures to improve system reliability.

Maintain runbooks, Standard Operating Procedures (SOPs), diagrams, and documentation for swift incident response.

Conduct post-incident reviews to improve reliability and contribute to the development of resilience strategies.

Achieve Service Level Indicators (SLIs) that are set to meet reliability objectives.

Certifications :

Azure Solutions Architect Expert (Microsoft)

AWS Certified Solutions Architect (AWS)

Open Group Certified Enterprise Architect (TOGAF)

PMP or Prince-2 in Project Management

Primary Skills :

Monitoring and Analysis

Continuously monitor CDC dashboards to track service performance and analyze reports.

Oversee production and DevOps infrastructure dashboards, ensuring system stability and identifying potential issues.

Observe alerts from New Relic and escalate them to the respective teams as needed.

Identify duplicated New Relic alerts and optimize alert configurations to reduce noise and improve efficiency.

Track daily alerts in production to enhance alert optimization strategies.

Maintain and update a list of dashboards monitored, including details such as widgets, metrics, and threshold values.

Create and manage dashboards for validating and monitoring CPU optimizations for Rapid and CDC services.

Perform sanity checks on Container Memory Utilization, Missing Pods, Container Restarts, Container CPU Utilization, Active Pods, Node Resource Consumption, and Pod Network Status to ensure system health.

Release and Deployment Management

Coordinate and execute weekly production releases, ensuring services are deployed with optimized CPU values.

Update central repositories with the latest service configurations and CPU requests.

Perform post-deployment sanity checks to validate service stability after production releases.

Redeploy CDC services with optimized CPU values, ensuring system performance improvements.

Monitor new CPU optimizations for Rapid and CDC services, tracking performance improvements and resource utilization.

Incident Management and RCA Documentation

Conduct incident analysis, identifying root causes and documenting findings for continuous improvement.

Maintain detailed Root Cause Analysis (RCA) documentation to track incidents and resolutions.

Provide reports on incident trends, helping improve response times and preventive measures.

Collaboration and Communication

Participate in daily SyncUpsand internal meetings to discuss ongoing tasks, challenges, and improvements.

Sync up with the (NOC) team to align on monitoring strategies and escalations.

Collaborate with the Database (DB) team for performance tuning and issue resolution.

Conduct knowledge transfer (KT) sessions on Rapid Resource

Optimization and related best practices.

Optimization and Continuous Improvement

Track CPU optimization efforts, ensuring proper resource allocation and utilization for Rapid and CDC services.

Analyze performance data to refine resource allocation strategies and improve system efficiency.

Identify and implement best practices for reducing alert noise and optimizing monitoring configurations.

Secondary Skills :

- Technical Knowledge
- Fluent in AWS key services (EBS, S3, AWS Compute, Storage, RDS etc).
- Expertise in Kubernetes or any Container Orchestration System.
- Knowledge of Infrastructure as a Code.
- Linux system administration knowledge.
- Knowledge of RDBMS and Document databases.
- Knowledge of Monitoring tools including AWS CloudWatch and NewRelic.
- Additional certification in Microsoft, Linux, Cisco, AWS or similar technologies is a plus.

Specialist - Site Reliability Engineer

5 hours ago

Pune, Maharashtra, India Accelya Group Full time ₹ 20,00,000 - ₹ 25,00,000 per year

For more than 40 years, Accelya has been the industry's partner for change, simplifying airline financial and commercial processes and empowering the air transport community to take better control of the future. Whether partnering with IATA on industry-wide initiatives or enabling digital transformation to simplify airline processes, Accelya drives the...
Specialist - Site Reliability Engineer

6 hours ago

Pune, Maharashtra, India Accelya Group Full time ₹ 15,00,000 - ₹ 25,00,000 per year

For more than 40 years, Accelya has been the industry's partner for change, simplifying airline financial and commercial processes and empowering the air transport community to take better control of the future. Whether partnering with IATA on industry-wide initiatives or enabling digital transformation to simplify airline processes, Accelya drives the...
Site Reliability Engineer

2 weeks ago

Pune, Maharashtra, India ENGEL Full time ₹ 6,00,000 - ₹ 18,00,000 per year

Company DescriptionENGEL is a global leader in the production of injection moulding machines and their automation. The company produces systems that manufacture plastic parts used in various industries such as automotive, packaging, and consumer goods. With nine production plants worldwide and subsidiaries and representatives in over 85 countries, ENGEL...
Site Reliability Engineer

4 hours ago

Pune, Maharashtra, India Idox Full time ₹ 9,00,000 - ₹ 12,00,000 per year

Site Reliability Engineer (AWS)Pune, IndiaAbout the roleWe are seeking a driven and detail-oriented Site Reliability Engineer (SRE) with a strong passion for building resilient, scalable cloud infrastructure. This role offers an exciting opportunity for professionals with 2 to 4 years of experience in DevOps, Cloud, or Infrastructure to deepen their...
Site Reliability Engineer

3 weeks ago

Pune, Maharashtra, India Reveille Technologies Full time

Job Summary :We are seeking a skilled and proactive Site Reliability Engineer (SRE) with a strong DevOps mindset and hands-on experience in application troubleshooting. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our applications and infrastructure. This role requires a blend of software engineering,...
Site Reliability Engineer

3 weeks ago

Pune, Maharashtra, India Allianz Full time

Site Reliability Engineer (SRE) - One Identity Access ManagementThe primary objective of the Site Reliability Engineer (SRE) specializing in One Identity Access Management is to ensure the seamless operation, reliability, and scalability of IAM systems within the organization.This role is critical in maintaining system integrity, optimizing performance, and...
Site Reliability Engineering

9 hours ago

Pune, Maharashtra, India Deutsche Bank Full time ₹ 10,00,000 - ₹ 25,00,000 per year

Site Reliability Engineering (SRE) Lead, VPJob ID: R0402474Full/Part-Time: Full-timeRegular/Temporary: RegularListed: Location: PunePosition OverviewJob Title: Site Reliability Engineering (SRE) LeadCorporate Title: Vice PresidentLocation: Pune, IndiaRole DescriptionWe are seeking an experienced and highly capable Site Reliability Engineering (SRE) Lead to...
Site Reliability Engineer

3 weeks ago

Pune, Maharashtra, India LanceSoft, Inc Full time

Role and Responsibilities : Reporting to Engineering, the Site Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions, Payments, and Capital Markets business. In this role, the candidate will have the opportunity to make a lasting impact on the company's transformation journey, drive customer-centric...
- Site Reliability Engineer

3 weeks ago

Pune, Maharashtra, India ZOOP Full time

Role : Site Reliability Engineer. Location : Pune (on-site). Experience : 3+ years. Someone who has experience setting up an in-house monitoring platform with 99.99% uptime SLA using Victoria Metrics & Prometheus in Multi Region. Site Reliability Engineer Zoop. The Opportunity : We're seeking a Senior Site Reliability Engineer to elevate and standardize our...
Site Reliability Engineer

2 weeks ago

Pune, Maharashtra, India Global Payments Inc. Full time US$ 80,000 - US$ 1,50,000 per year

Every day, Global Payments makes it possible for millions of people to move money between buyers and sellers using our payments solutions for credit, debit, prepaid and merchant services. Our worldwide team helps over 3 million companies, more than 1,300 financial institutions and over 600 million cardholders grow with confidence and achieve amazing...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer