Site Reliability Engineer

4 hours ago

Bengaluru Karnataka India, Karnataka WhiteLotus Talent Partners Full time

We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes. In this role, you will focus on monitoring, basic troubleshooting, and incident response, helping to maintain high system availability, reliability, and performance. You will be responsible for identifying and addressing simple issues, as well as escalating more complex problems to senior SREs when needed.

The ideal candidate should have a basic understanding of cloud infrastructure (especially OpenStack and Kubernetes), containerized environments, and system monitoring. This position offers an excellent opportunity for someone looking to grow into a more advanced SRE or DevOps role.

Key Responsibilities:

For L0 Support (Level 0):

Incident Monitoring & Triage:
Respond to system alerts, monitor infrastructure health using tools like Prometheus, Grafana, and Observability for both OpenStack and Kubernetes.
Identify low-level issues and follow runbooks or predefined scripts to perform first-level triage.
Document and escalate unresolved incidents to L1 or L2 based on established escalation protocols.
System Health Checks:
Perform daily health checks for Kubernetes pods, nodes, and OpenStack instances.
Verify basic functionality of VMs, containers, and network services within the environment.
Basic Troubleshooting:
Resolve simple issues such as VM reboots, pod failures, and network connectivity issues within OpenStack or Kubernetes environments.
Follow the predefined steps for basic troubleshooting tasks like restarting services or clearing logs.
Ticket Management:
Log incidents and issues into a ticketing system (e.g., JIRA, ServiceNow) for tracking and escalation.
Update incident tickets and provide relevant information for ongoing resolution efforts.

=========================================================================================================

For L1 Support (Level 1):

Incident Resolution:
Investigate and resolve more complex issues compared to L0, such as Kubernetes pod crashes, network misconfigurations in OpenStack, and minor service disruptions.
Work with tools like kubectl to troubleshoot Kubernetes pods and nodes, and OpenStack CLI to diagnose problems with VMs, storage, and networks.
Automation & Scripting:
Automate routine tasks, such as VM provisioning, pod deployments, or status checks, using basic scripting languages (Python, Bash).
Improve automation workflows based on feedback and frequently encountered issues.
Log Aggregation & Monitoring:
Review logs and metrics collected from ELK Stack, Prometheus, Grafana, or other logging tools to detect trends and potential issues.
Analyze logs and metrics from OpenStack and Kubernetes clusters to pinpoint underlying problems (e.g., high CPU usage, memory leaks).
Basic Network & Storage Management:
Investigate networking issues related to Neutron (for OpenStack) and CNI configurations (for Kubernetes).
Manage storage resources within OpenStack and Kubernetes (e.g., creating persistent volumes, debugging storage access issues).
Collaboration & Escalation:
Work closely with L2 and L3 engineers for complex troubleshooting or advanced system issues that require in-depth knowledge.
Share knowledge with the team and assist in creating new documentation or updating existing troubleshooting guides.
User and Permissions Management:
Perform basic user management tasks within OpenStack (e.g., creating and managing tenants, security groups).
Review and modify Kubernetes RBAC (Role-Based Access Control) settings based on user access needs.

Skills & Qualifications:

Required Skills:

Basic Cloud & Kubernetes Knowledge:
Familiarity with OpenStack architecture (e.g., Nova, Neutron, Cinder).
Basic understanding of Kubernetes components, including pods, services, deployments, and namespaces.
Systems & Networking:
Knowledge of Linux/Unix-based operating systems (e.g., Ubuntu, CentOS, Red Hat).
Understanding of networking concepts like DNS, IP routing, and VLANs in cloud environments.
Monitoring & Alerting Tools:
Familiarity with monitoring tools like Prometheus, Grafana, Zabbix, or CloudWatch for alert management and system health monitoring.
Troubleshooting & Incident Response:
Experience in using log aggregation tools (ELK stack, Splunk) and interpreting logs for incident detection.
Ability to perform basic troubleshooting steps (e.g., restarting services, running basic shell commands) to resolve issues.
Communication Skills:
Strong communication skills to collaborate effectively with senior SREs, developers, and other teams.
Ability to document incidents, solutions, and troubleshooting steps clearly.

Preferred Skills:

Basic Scripting & Automation:
Exposure to scripting languages such as Bash, Python, or Go to automate basic administrative tasks.
Cloud Platform Experience:
Familiarity with other cloud technologies such as AWS, Azure, or Google Cloud Platform.
Certifications:
Basic certifications such as CompTIA Linux+, AWS Certified Solutions Architect, Kubernetes Fundamentals (CKA), or OpenStack COA are a plus.

Site Reliability Engineer

4 hours ago

Bengaluru, Karnataka, India, Karnataka Karix Full time

Role: Site Reliability EngineerLocation: Bangalore (WFO)About the role: We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Service platforms. They are responsible for monitoring, automating, and improving the...
Site Reliability Engineer

4 hours ago

Bengaluru, Karnataka, India, Karnataka Glocomms Full time

We are currently looking for an SRE Lead - to join our customer - an IT consultancy with urgent projects on board.This will be a 6 month contract initially with an option to extend further.Must have 10+ years exp.Responsibilities:Assess application architecture and implement patterns for reliability and performance.Automate workflows and reduce manual toil...
Site Reliability Engineer

4 hours ago

Bengaluru, Karnataka, India, Karnataka Landmark Group Full time

What You’ll Do:• Ensure reliability and high availability of Java and microservices-based applications through proactive monitoring and automation.• Define and track SLIs/SLOs to maintain service performance and stability.• Troubleshoot and resolve production issues, performing detailed root cause analysis to prevent recurrence.• Build and enhance...
Site Reliability Engineering

2 weeks ago

Bengaluru, Karnataka, India Viraaj HR Solutions Private Limited Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Site Reliability Engineer (SRE)About The OpportunityA fast-growing organization in the Enterprise Cloud Infrastructure & SaaS sector delivering highly available, mission-critical services to enterprise customers. We are hiring an on-site Site Reliability Engineer in India to own reliability, automation, and operational excellence across cloud-native...
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India super Full time ₹ 12,00,000 - ₹ 24,00,000 per year

Site Reliability Engineer (SRE) Level 3Overview:A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and highly reliable systems. This role emphasizes a blend of software and systems engineering to ensure the availability, latency, performance, and capacity...
Site Reliability Engineer

6 days ago

Bengaluru, Karnataka, India eBay Full time ₹ 12,00,000 - ₹ 36,00,000 per year

At eBay, we're more than a global ecommerce leader — we're changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We're committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.Our customers are our compass, authenticity...
Site Reliability Engineering

4 hours ago

Bengaluru, Karnataka, India, Karnataka Tata Consultancy Services Full time

Role:Site Reliability Engineering (SRE)Experience: 5-12 yearsLocations:Chennai, Bangalore, HyderabadMust have & good to have:Mode of Interview : walkin drive(22-Nov-2025) in person interview/F2F Desired Competencies (Technical/Behavioral Competency)Must-HaveMinimum 5 mandate details are mandate with two or 3 liners1.Exposure to any APM tool like Dynatrace,...
Site Reliability Engineer

1 week ago

Bengaluru, Karnataka, India Zetamicron Full time ₹ 12,00,000 - ₹ 36,00,000 per year

Job Title: Site Reliability Engineer (SRE)About the RoleWe are seeking a highly skilled and proactive Site Reliability Engineer (SRE)to ensure the stability, scalability, and reliability of our platform. The ideal candidate will have strong experience in managing production environments, automating operational processes, and enhancing system performance...
Site Reliability Engineer

5 days ago

Bengaluru, Karnataka, India Barycenter Technologies Full time ₹ 5,00,000 - ₹ 15,00,000 per year

Job Description: Site Reliability Engineer (SRE)Must have skills :Kubernetes (Networking, storage), python & Linux.Good to Have skills:Reporting and Monitoring Tools (Grafana, Loki, Dynatrace)
Site Reliability Engineer

2 weeks ago

Bengaluru, Karnataka, India Chevron Full time ₹ 20,00,000 - ₹ 25,00,000 per year

Total Number of Openings2About the position:Come join our Subsurface Digital Platform where we are driving continuous innovations to improve reliability, scalability and sustainability of Chevron business via Chevron's Digital Transformation. We are seeking a T-shaped dynamic Senior Site Reliability Engineer to lead and provide end-to-end solution support...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer