Site Reliability Engineer

4 days ago


Bangalore Karnataka, India Visa Full time

Company Description Visa is a world leader in payments and technology with over 259 billion payments transactions flowing safely between consumers merchants financial institutions and government entities in more than 200 countries and territories each year Our mission is to connect the world through the most innovative convenient reliable and secure payments network enabling individuals businesses and economies to thrive while driven by a common purpose - to uplift everyone everywhere by being the best way to pay and be paid Make an impact with a purpose-driven industry leader Join us today and experience Life at Visa We are seeking an accomplished Site Reliability Engineer SRE - Sr Consultant to join our dynamic Observability team In this senior role you will provide technical leadership in developing and maintaining reliable secure and cost-effective observability solutions that support our global operations As the Sr consultant SRE you will serve as the strategic bridge between development and operations ensuring all systems and services are efficient highly available resilient and scalable You will collaborate closely with software engineers system administrators and cross-functional stakeholders to drive automation optimize performance and enable seamless application delivery You will take end-to-end ownership of critical observability initiatives with a strong focus on availability performance security and reliability You will lead the design and implementation of robust monitoring alerting and automation frameworks to minimize incidents and accelerate incident resolution Your leadership will be instrumental in guiding and mentoring the team ensuring best practices are consistently adopted and operational excellence is maintained Key responsibilities include driving continuous improvement across processes tools and technologies leading root cause analysis and developing preventive measures for production incidents You will champion a culture of collaboration innovation and proactive problem-solving supporting engineering teams with the technical expertise needed to meet demanding requirements As an integral member and leader within our Agile Scrum teams your technical acumen leadership skills and ability to mentor others will be central to delivering impactful high-quality results Responsibilities Lead SRE and DevOps operations during APAC hours ensuring alignment with project objectives delivery timelines SLAs and OLAs Act as the primary escalation point for complex technical issues and incidents driving resolution and communicating status to leadership and stakeholders Provide strategic input and recommendations on SRE and DevOps initiatives to management supporting roadmap planning and resource allocation Coordinate and manage relationships with multiple stakeholders both internal and external across various technology domains Analyze production defects perform in-depth root cause analysis across code data and infrastructure and champion the implementation of long-term preventative solutions Mentor guide and inspire team members through technical leadership code reviews pairing and ongoing knowledge sharing Lead security and compliance efforts by ensuring timely application of security patches hotfixes and adherence to cybersecurity best practices Oversee the design deployment and continuous improvement of monitoring alerting and logging instrumentation ensuring comprehensive observability Architect and drive the development of automation frameworks to optimize operational efficiency eliminate manual toil and streamline system integration Manage and support observability platforms including Splunk ClickHouse Grafana Prometheus M3DB OpenTelemetry Fluent Bit ElasticSearch OpenSearch and CloudWatch Collaborate with development and product teams to design and implement scalable monitoring solutions and support the creation of reliable environments across the SDLC Promote and enforce DevOps and SRE best practices fostering a culture of automation reliability and continuous improvement across the organization Design implement and maintain robust CI CD pipelines enabling rapid reliable and automated software delivery Administer optimize and scale cloud infrastructure AWS GCP to ensure high availability performance and security Lead the adoption and management of infrastructure as code practices using tools such as Terraform Ansible or CloudFormation Continuously monitor and analyze system health proactively identifying and mitigating risks to reliability and performance Oversee deployment and management of containerization and orchestration solutions Docker Kubernetes for modern application delivery Drive incident management processes including leading post-incident reviews facilitating blameless postmortems and implementing actionable improvements Create maintain and improve detailed documentation for infrastructure processes runbooks and standard operating procedures Provide advanced technical support and troubleshooting guiding team members through complex infrastructure and deployment issues Identify propose and implement opportunities for process tooling and workflow automation to drive operational excellence Lead disaster recovery planning capacity management and business continuity initiatives in collaboration with cross-functional teams Evaluate recommend and drive the adoption of new technologies tools and practices that enhance reliability scalability and observability Present technical strategies incident findings and project updates to executive leadership and cross-functional stakeholders Foster an inclusive and collaborative team environment supporting professional growth and the continuous development of SRE best practices Visa s Observability ecosystem includes over 2 000 platform nodes utilizing approximately 15 different tools for logging monitoring and tracing alongside 80 000 client agents The system handles daily log ingestion exceeding 100TB and oversees hundreds of critical applications supporting vital alerts dashboards and reports To maintain this high level of performance and reliability we need a Site Reliability Engineer - Sr Consultant with comprehensive knowledge and practical experience This position requires an I6 5-level engineer who can operate independently with minimal supervision About Visa s PRE Observability Team Visa s Product Reliability Engineering PRE Observability team partners with Product Development as well as Operations Infrastructure teams to build and manage innovative reliable scalable secure and cost-effective observability platform solutions We are looking for talented Senior Site Reliability Engineers to join our driven team with a focus on maximizing system availability performance security and reliability This dynamic role requires technical leadership strong problem-solving skills and expertise in coding testing and debugging This is a hybrid position Expectation of days in office will be confirmed by your hiring manager Qualifications Basic Qualifications Bachelor s degree with 10-14 years of relevant professional experience Preferred Qualifications Extensive hands-on experience with observability tools such as Splunk ClickHouse Grafana Prometheus M3DB OpenTelemetry Fluent Bit ElasticSearch OpenSearch and CloudWatch Proven ability to set up and manage exporters e g Node Exporter Cert Exporter and others for metrics collection Deep experience with containerization and orchestration platforms including Docker and Kubernetes Strong background in CI CD pipeline management using tools such as GitHub and Ansible Proficiency with Infrastructure as Code IaC technologies such as Terraform and configuration management practices like GitOps Advanced scripting skills in Python and or Shell within Linux environments experience with Unix scripting Working knowledge of query languages such as PromQL MS SQL or Splunk SPL is highly desirable Cloud certifications in AWS or GCP are a significant advantage Demonstrated ability to analyze complex technical problems and solutions and to communicate effectively at the appropriate level of detail with both technical and non-technical stakeholders Exceptional communication collaboration and leadership skills with a proven track record of leading and mentoring technical teams Strong organizational and problem-solving abilities with an aptitude for driving process improvements and operational excellence Additional Information Visa is an EEO Employer Qualified applicants will receive consideration for employment without regard to race color religion sex national origin sexual orientation gender identity disability or protected veteran status Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law



  • Bangalore, Karnataka, India NatWest Group Full time

    Join us as a Site Reliability Engineer In this key role you ll support the improvement of non-functional and operational characteristics such as availability performance efficiency change management monitoring security incident response and capacity planning of our products and services You ll enjoy significant stakeholder interaction working in...


  • Bangalore, Karnataka, India NatWest Group Full time

    Join us as a Site Reliability Engineer In this key role youll support the improvement of non-functional and operational characteristics such as availability performance efficiency change management monitoring security incident response and capacity planning of our products and services Youll enjoy significant stakeholder interaction working in...


  • Bangalore, Karnataka, India JPMorgan Chase Full time

    Job Category Software Engineering There s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world s most complex and mission-critical systems As a Site Reliability Engineer III at JPMorgan Chase within the Employee Platforms team you will solve...


  • Bangalore, Karnataka, India Akamai Full time

    Job Category Site Reliability Do you like collaborating across teams to solve complex problems Do you enjoy solving large scale distributed content delivery challenges Join our critical Platform and Reliability Engineering Team The Platform Reliability Engineering team defines measures and optimizes key performance indicators for Akamai s global network This...


  • Bangalore, Karnataka, India Deutsche Bank Full time

    Job Title Site Reliability EngineerLocation Bangalore IndiaCorporate Title AssociateRole Description You will work closely with application teams to ensure stable well monitored applications that are resilient to faults You will agree and review Service Level Objectives SLOs to achieve high availability for applications based on their criticality ...


  • Bangalore, Karnataka, India NatWest Group Full time

    Join us as a Site Reliability Engineer Youll manage the provision of stable resilient reliable applications with the end goal of minimising disruption to Customer Colleague Journeys CCJ Well look to you to identify and automate manual tasks and implement observability solutions ensuring a thorough understanding of CCJ across applications This...


  • Bangalore, India ViewSonic Full time

    Job Requirements: Bachelor's degree in Computer Science, Engineering, or a related field. 3+ year of experience in a relevant role, such as Site Reliability Engineer, Dev Ops Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions including EC2, S3, Cloud Watch, Lambda, and RDS. Interest and understanding of Platform...


  • Bangalore, India ViewSonic Full time

    Job Requirements: Bachelor's degree in Computer Science, Engineering, or a related field. 3+ year of experience in a relevant role, such as Site Reliability Engineer, Dev Ops Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions including EC2, S3, Cloud Watch, Lambda, and RDS. Interest and understanding of Platform...


  • Bangalore, India HDFC Limited Full time

    Hiring for Lead / Sr Site Reliability Engineer for Mumbai & Bangalore Location Experience - 8 - 14 Years Job Purpose Analysing, troubleshooting, and designing vital services, platforms, and infrastructure on GCP while always thinking about reliability, scalability, resilience, security, and performance. Job Responsibilities: Help build a Site...


  • Bangalore, India WhiteLotus Talent Partners Full time

    We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by Open Stack and Kubernetes . In this role, you will focus on monitoring , basic troubleshooting , and incident response , helping to maintain high...