Site Reliability Professional

4 days ago


Bengaluru, Karnataka, India beBeeReliability Full time US$ 2,00,000 - US$ 2,50,000
Job Overview:
Palo Alto Networks is seeking a skilled Senior Site Reliability Engineer to join our Infrastructure & Cloud Operations team. In this role, you will work closely with our Network, Compute, Security, Database, Applications, and other teams to provide availability, reliability, and observability for our global IT infrastructure environments.

The ideal candidate will have hands-on experience with Enterprise infrastructure and application monitoring and reporting tools, containers and orchestration, and Infrastructure as Code knowledge. They will also possess strong scripting skills and be fluent in Linux. Additional qualifications include proficiency in CI/CD platforms, excellent problem-solving skills, and the ability to work independently under pressure.

Responsibilities:
  • Implementing and supporting the Linux infrastructure as code where our globally distributed customer-facing platform runs.
  • Provision, configure & support resilient hybrid cloud deployment architecture using the automation framework and make it more efficient.
  • Manage Linux infrastructure CI/CD platform, work with other SREs in deploying and maintaining automation framework, capacity planning, create and review PKI operational runbooks.
  • Manage scalability, capacity planning, redundancy, and resiliency.
  • Maintain service availability and performance SLAs based on business and product requirements.
  • Contribute to documentation related to design, deployment, validation, operations and DR/BCP.
  • Design proactive service monitoring, alerting and trend analysis of underlying infrastructure, and support the operations team in implementation.
  • Build and operate compute fabric for 1000s of VMs, Kubernetes Clusters. Develop scripts, build tools and write code to automate routine tasks.
  • Provide technical support to platform users.
  • Respond to security implementation and audits of the environment.
  • Plan maintenance windows, write up change requests, present technical updates.
  • Participate in On-Call support including participating in RCA as required.
  • Design and implement network, compute and application-level monitoring solutions.
  • Implement integrated and automated processes that drive operational excellence.
  • Advise on industry best practices as it relates to new product selection.
  • Drive operational cadences around business planning and performance management to ensure the efficient running of the IT org.

Requirements:
  • First-hand experience with Enterprise infrastructure and application monitoring and reporting tools.
  • Strong working experience and exposure to containers and orchestration (Docker, Kubernetes).
  • Infrastructure as Code knowledge - Terraform, Ansible, Git, Puppet.
  • Fluent Scripting skills preferably Python OR Shell OR Bash.
  • Exposure to Public Cloud Platforms - GCP (Google cloud) OR AWS.
  • Proficient in CI/CD platforms like Jenkins, CircleCI, etc.
  • Excellent problem-solving skills; ability to multi-task and prioritize.
  • Ability to work independently; works well under pressure.
  • Possess solid communication skills, and will be comfortable working in a fast-paced technical environment.
  • Background knowledge of network and security technologies.
  • Strong hands-on Linux experience in managing and supporting Linux server infrastructure in CentOS/RHEL/Ubuntu.
  • Bachelors/Masters degree in Computer Science, Information Technology or technical stream with the equivalent combination of work experience required.
  • Design and performance tuning for Linux infrastructure and API, in-depth knowledge of multi-tier web applications.
  • Experience in developing and managing APIs, understanding of API infrastructure optimization and security.
  • In-depth knowledge of Certificate Lifecycle Management.
  • Fluent in Linux security & system hardening, vulnerability management & patching process. Familiarity with CIS compliance levels.
  • Must be comfortable with Ansible, Chef or similar configuration management tool to manage infrastructure as code and source code control systems such as GIT or SVN.
  • Ability to work cross-functionally across multiple business units, such as product development and engineering.
  • Must be able to collaborate with a global team spread across multiple time zones.
  • Passion, drive, energy, a sense of humour and a great attitude.
  • 6+ years of relevant experience, Bachelor or Master's degree in Computer Science or a related technical field.
  • Experience with administration and orchestration of cloud computing (AWS, GCP, etc.) running virtual or container environments.
  • Good user and admin Linux skills (Ubuntu a plus). Experience with virtual networking.
  • Working experience with IaC tools like Terraform and Ansible. Knowledge of Python and shell scripting.
  • Experience with CI/CD development using platforms like - Jenkins, Harness, Artifactory.
  • Solid problem solving, troubleshooting, critical thinking, communication, and teamwork skills.
  • Passion for automation and monitoring instrumentation in the code.
  • Fluency in coding with one or more - Python, Go, Java, You will have to take coding and design tests as required.
  • Experience in Infrastructure as Code environment - Terraform, Ansible.You will be asked to write and troubleshoot IaC code during interview.
  • Proficient in Kubernetes based deployments, CI/CD platforms like Jenkins, Harness etc..
  • Takes great care in documenting conceptual work, detailed design specifications and can present ideas to engineers and engineering leaders.
  • Knowledge of AIOps, Application of Machine Learning/Artificial Intelligence in Cloud Infrastructure or IT Operations.

Nice-to-Haves:
  • Development of self-healing infrastructure and applications.
  • Understanding of Big data, data analytics theory and application.
  • Exposure to Enterprise Business Applications, ITSM frameworks and tools.

What We Offer:
  • A competitive compensation package.
  • Opportunities for professional growth and development.
  • A collaborative and dynamic work environment.
  • A chance to work on challenging projects that impact the company's success.


  • Bengaluru, Karnataka, India beBeeReliability Full time ₹ 1,50,00,000 - ₹ 2,50,00,000

    Embark on a challenging career path as a Reliability Engineer, where you will be responsible for ensuring the stability and performance of complex systems.Job DescriptionAs a Reliability Engineer, your primary objective is to identify and mitigate potential risks that could impact system reliability. This involves analyzing system designs, developing testing...


  • Bengaluru, Karnataka, India TRUGlobal Full time ₹ 9,00,000 - ₹ 12,00,000 per year

    Job Title: Site Reliability Engineer (SRE) with Python Development ExpertisePosition Overview: We are seeking a skilled Site Reliability Engineer (SRE) with strong Python development experience to join our team. The ideal candidate will be responsible for ensuring the reliability, availability, and performance of our services across both on-premises and...


  • Bengaluru, Karnataka, India Creencia Technologies Pvt Ltd Full time

    We are recruiting an experienced Site Reliability Engineer to join our newly established TechOps division within the Technology department. We maintain the systems that keep our products running smoothly around the world, 24x7 - supporting everything from cloud infrastructure and CI/CD pipelines to observability and incident response.How you will contribute...


  • Bengaluru, Karnataka, India WhiteLotus Talent Partners Full time

    We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes . In this role, you will focus on monitoring , basic troubleshooting , and incident response , helping to maintain high...


  • Bengaluru, Karnataka, India WhiteLotus Talent Partners Full time

    We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes. In this role, you will focus on monitoring, basic troubleshooting, and incident response, helping to maintain high system...


  • Bengaluru, Karnataka, India Enterprise Minds, Inc Full time

    We're Hiring | Site Reliability Engineer | 8-10 years


  • Bengaluru, Karnataka, India beBeeSre Full time US$ 1,80,000 - US$ 2,50,000

    Job OverviewWe are seeking a seasoned professional to lead our site reliability efforts. The ideal candidate will have a strong background in software engineering and system administration, with a proven track record of driving high availability and reliability in complex systems.Key Responsibilities:Oversee the development and implementation of automation...


  • Bengaluru, Karnataka, India NatWest Group Full time

    Join us as a Site Reliability Engineer In this key role you ll support the improvement of non-functional and operational characteristics such as availability performance efficiency change management monitoring security incident response and capacity planning of our products and services You ll enjoy significant stakeholder interaction working in...


  • Bengaluru, Karnataka, India Uplers Full time US$ 2,00,000 - US$ 3,40,000 per year

    Site Reliability EngineerExperience: 3 - 8 Years ExpSalary : 20LPA to 34LPAPreferred Notice Period: Within 30 DaysOpportunity Type: Office (Bengaluru)Placement Type: Permanent(*Note: This is a requirement for one of Uplers' Clients)Must have skills:DevOps, AWS OR Azure, PythonPracto (One of Uplers' Clients) is Looking for:Site Reliability Engineer who is...


  • Bengaluru, Karnataka, India beBeeReliability Full time ₹ 1,50,00,000 - ₹ 2,00,00,000

    Platform Stability and Reliability LeadEnsure the platform meets performance, availability, and reliability service level agreements.Proactively identify and resolve performance bottlenecks and risks in production environments through root cause analysis and corrective actions.Maintain and improve monitoring, logging, and alerting frameworks to detect and...