Site Reliability Engineer

5 days ago

bangalore district, India NR Consulting Full time

Total Experience - 7+ Years

Relevant Experience- 5+ Years

Must have Experience in GPU at least 1 Year

Notice Period - up to 30 Days

JD :

We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI/ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI/ML pipelines and GPU-based compute environments.

Key Responsibilities

- Infrastructure Management: Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.

- Documentation: Maintain clear documentation for infrastructure setups, and processes.

- System Management: Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.

- GPU Infrastructure: Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI/ML and HPC applications.

- Troubleshooting: Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.

- High-Speed Interconnects: Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.

- Automation: Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.

- CI/CD Pipelines: Build and optimize continuous integration and deployment (CI/CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.

- Containerization & Orchestration: Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains

- Monitoring & Performance: Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.

- Security and Compliance: Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.

- Cluster Support: Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.

- Scalability: Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.

- Collaboration: Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.

Required Qualifications

- Experience: 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.

Site Reliability Engineer

2 weeks ago

bangalore district, India ViewSonic Full time

Job Requirements: Bachelor's degree in Computer Science, Engineering, or a related field. 3+ year of experience in a relevant role, such as Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions including EC2, S3, CloudWatch, Lambda, and RDS. Interest and understanding of Platform...
Site Reliability Engineer

7 days ago

bangalore district, India HDFC Limited Full time

Hiring for Lead / Sr Site Reliability Engineer for Mumbai & Bangalore Location Experience - 8 - 14 Years Job Purpose Analysing, troubleshooting, and designing vital services, platforms, and infrastructure on GCP while always thinking about reliability, scalability, resilience, security, and performance. Job Responsibilities: Help build a Site...
Site Reliability Engineer

1 week ago

bangalore district, India Trantor Full time

Job Title - Site Reliability Engineer Role- Contract (9 Months- Extendable) Exp- 5+ years Loc- Bangalore ( Hybrid) Notice- Immediate joiner only Duties: Responsible for maintaining and scaling production services and servers across multiple data centers for complex and data-intensive cloud services Improve scalability, service reliability, capacity,...
Site Reliability Engineer

2 weeks ago

bangalore district, India Synechron Full time

We have immediate opportunity for SRE (Senior Site Reliability Engineer) 5 to 9 years. Synechron – Bangalore Job Role: - SRE (Senior Site Reliability Engineer) Job Location: - Bangalore Notice Period: Within 30days About Synechron We began life in 2001 as a small, self-funded team of technology specialists. Since then, we’ve grown our...
Site reliability engineer

2 weeks ago

Bangalore, India ViewSonic Full time

Job Requirements: Bachelor's degree in Computer Science, Engineering, or a related field. 3+ year of experience in a relevant role, such as Site Reliability Engineer, Dev Ops Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions including EC2, S3, Cloud Watch, Lambda, and RDS. Interest and understanding of Platform...
Site reliability engineer

2 days ago

Bangalore, India ViewSonic Full time

Job Requirements: Bachelor's degree in Computer Science, Engineering, or a related field. 3+ year of experience in a relevant role, such as Site Reliability Engineer, Dev Ops Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions including EC2, S3, Cloud Watch, Lambda, and RDS. Interest and understanding of Platform...
Site reliability engineer

2 weeks ago

Bangalore, India HDFC Limited Full time

Hiring for Lead / Sr Site Reliability Engineer for Mumbai & Bangalore Location Experience - 8 - 14 Years Job Purpose Analysing, troubleshooting, and designing vital services, platforms, and infrastructure on GCP while always thinking about reliability, scalability, resilience, security, and performance. Job Responsibilities: Help build a Site...
Site Reliability Engineer

7 days ago

bangalore district, India LTIMindtree Full time

Hi We are looking for SRE/Devops, Job Title – Site Reliability Engineer Job Location – Bangalore Please find below JD: • Hiring Location: Bangalore • Can we hire any LTI Location: Bangalore location • Grade looking for: 8Yrs to 13Yrs • Notice Period details: Immediate Joiner • Will there be a Client Round: No • Mandatory Technical...
Site reliability engineer

2 weeks ago

Bangalore, India WhiteLotus Talent Partners Full time

We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by Open Stack and Kubernetes . In this role, you will focus on monitoring , basic troubleshooting , and incident response , helping to maintain high...
Senior Site Reliability Engineer

2 weeks ago

bangalore district, India Allegion Full time

Allegion India is seeking a highly motivated Senior Site Reliability Engineer who will play a critical role in ensuring the reliability, scalability, and performance of our organization's systems and infrastructure, who will work with a team of cross-functional product development engineers to design, implement, and maintain highly available and resilient...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer