WCS SRE

2 days ago


Bengaluru India NuStar Technologies Full time

Job Description

Location: Kolkata/Bangalore/Chennai

Exp:8-12yrs

JD

Job Title: Site Reliability Engineer (SRE)

Position Overview: We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our IT Operations team. The SRE will be responsible for ensuring the reliability, availability, and performance of our applications and services. This role involves implementing and maintaining SRE best practices, developing automation tools, monitoring system health, and collaborating with development and operations teams to improve system resilience and efficiency. The ideal candidate will have a strong background in IBM WebSphere Commerce (WCS), cloud platforms, automation, monitoring tools, and an understanding of chaos engineering principles.

Key Responsibilities

- Reliability and Performance:
- Ensure the reliability, availability, and performance of applications and services.
- Implement and maintain SRE best practices to improve system resilience and efficiency.
- Develop and maintain service level objectives (SLOs), service level indicators (SLIs), and service level agreements (SLAs).
- Automation and Tooling:
- Develop and implement automation tools to streamline operations and reduce manual intervention.
- Use Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible) to manage and provision infrastructure.
- Automate deployment, monitoring, and incident response processes.
- Monitoring and Incident Management:
- Monitor system health and performance using monitoring tools (e.g., Prometheus, Grafana, ELK stack).
- Identify and address potential issues before they impact users.
- Develop and maintain incident response plans and ensure rapid response to incidents and outages.
- Chaos Engineering:
- Implement chaos engineering practices to test and improve system resilience.
- Design and execute controlled experiments to identify weaknesses and potential points of failure in the system.
- Analyze the results of chaos experiments and implement improvements to enhance system reliability.
- Collaboration and Communication:
- Collaborate with development and operations teams to ensure seamless integration and delivery of services.
- Provide guidance and support to development teams on SRE best practices and tools.
- Participate in project planning and provide input on infrastructure and operational requirements.
- Continuous Improvement:
- Stay current with industry trends and emerging technologies, continuously seeking opportunities to improve processes and tools.
- Foster a culture of continuous improvement within the team, encouraging innovation and the adoption of best practices.
- Participate in training and development activities to enhance skills and knowledge in SRE and cloud technologies.
- Security and Compliance:
- Ensure that security best practices are integrated into all aspects of the SRE processes.
- Work closely with the security team to identify and mitigate potential vulnerabilities and ensure compliance with security policies and standards.

Qualifications

- 5+ Years of IT experience with strong WCS commerce background
- Experience with containerization technologies (e.g., Docker, Kubernetes).
- Proven experience as a Site Reliability Engineer (SRE) or similar role, with a strong background in cloud platforms (e.g., Azure, Google Cloud).
- In-depth knowledge of SRE principles and best practices.
- Bachelor's degree in computer science, Information Technology, or a related field.
- Experience with automation tools and frameworks (e.g., Terraform, Ansible).
- Strong understanding of monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Understanding of chaos engineering principles and experience implementing chaos engineering practices.
- Excellent problem-solving skills and the ability to work under pressure.
- Strong communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams.