
AI Cloud Infrastructure Engineer
3 days ago
Total Experience - 7+ Years
Relevant Experience- 5+ Years
Must have Experience in GPU at least 1 Year
Notice Period - up to 30 Days
JD :
We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI/ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI/ML pipelines and GPU-based compute environments.
Key Responsibilities
- Infrastructure Management: Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.
- Documentation: Maintain clear documentation for infrastructure setups, and processes.
- System Management: Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.
- GPU Infrastructure: Configure, monitor, and troubleshoot GPU hardware (e.G., NVIDIA GPUs) and related software stacks (e.G., CUDA, cuDNN) for optimal performance in AI/ML and HPC applications.
- Troubleshooting: Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.
- High-Speed Interconnects: Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.
- Automation: Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.
- CI/CD Pipelines: Build and optimize continuous integration and deployment (CI/CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.
- Containerization & Orchestration: Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains
- Monitoring & Performance: Set up and maintain monitoring, logging, and alerting systems (e.G., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.
- Security and Compliance: Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.
- Cluster Support: Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.
- Scalability: Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.
- Collaboration: Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.
Required Qualifications
- Experience: 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.
-
Cloud AI Infrastructure Architect
2 days ago
Bengaluru, Karnataka, India Infosys Full time ₹ 12,00,000 - ₹ 36,00,000 per yearCloud AI Infrastructure Architect• Proven experience designing, implementing, and managing cloud solutions on major cloud platforms (e.g., AWS, Azure, GCP). • Strong understanding of cloud computing concepts, architectures, and services (IaaS, PaaS, SaaS). • Hands-on experience with cloud automation and infrastructure-as-code tools (e.g., Terraform,...
-
Cloud AI Infrastructure Architect
3 weeks ago
Bengaluru, India Infosys Full timeKey Responsibilities: As a Cloud AI Infra Architect you should have with a minimum of 10 years of experience in managing Cloud Enterprise infrastructure projects and driving automation through Gen AI drive the adoption optimization of our cloud infrastructure and services Design implement and evolve highly available scalable and secure multi cloud...
-
Cloud AI Infrastructure Architect
3 weeks ago
Bengaluru, India Infosys Limited Full timeJob Description Key Responsibilities: - As a Cloud AI Infra Architect you should have with a minimum of 10 years of experience in managing Cloud Enterprise infrastructure projects and driving automation through Gen AI drive the adoption optimization of our cloud infrastructure and services - Design implement and evolve highly available scalable and secure...
-
Cloud Infrastructure
1 week ago
Bengaluru, Karnataka, India Oracle Full time ₹ 1,50,00,000 - ₹ 2,50,00,000 per yearOracle Cloud Infrastructure blends the speed of a startup with the scale of an enterprise leader. Our Generative AI Solutions team builds advanced AI solutions that run on powerful cloud infrastructure tackling real-world, global challenges. As part of this team, you'll contribute to large-scale cloud solutions utilizing cutting-edge machine learning and...
-
Ai software engineer
2 days ago
Bengaluru, India Blue Cloud Softech Solutions Limited Full timeJob Title: AI Software Engineer (Generative AI & Cloud-Native Systems)Location: Hybrid (Bangalore)Experience: 3–4 years in software development + 1–2 years in AI (agentic/generative)About the RoleWe’re building scalable, production-grade full-stack AI applications where AI capabilities (like generative models and agentic workflows) are deeply...
-
Ai software engineer
1 day ago
Bengaluru, India Blue Cloud Softech Solutions Limited Full timeJob Title: AI Software Engineer (Generative AI & Cloud-Native Systems)Location: Hybrid (Bangalore)Experience: 3–4 years in software development + 1–2 years in AI (agentic/generative)About the RoleWe’re building scalable, production-grade full-stack AI applications where AI capabilities (like generative models and agentic workflows) are deeply...
-
AI Software Engineer
2 weeks ago
Bengaluru, India Blue Cloud Softech Solutions Limited Full timeJob Title: AI Software Engineer (Generative AI & Cloud-Native Systems)Location: Hybrid (Bangalore)Experience: 3–4 years in software development + 1–2 years in AI (agentic/generative)About the RoleWe’re building scalable, production-grade full-stack AI applications where AI capabilities (like generative models and agentic workflows) are deeply...
-
AI Software Engineer
1 day ago
Bengaluru, India Blue Cloud Softech Solutions Limited Full timeJob Title: AI Software Engineer (Generative AI & Cloud-Native Systems)Location: Hybrid (Bangalore)Experience: 3–4 years in software development + 1–2 years in AI (agentic/generative)About the RoleWe’re building scalable, production-grade full-stack AI applications where AI capabilities (like generative models and agentic workflows) are deeply...
-
AI Software Engineer
2 days ago
Bengaluru, India Blue Cloud Softech Solutions Limited Full timeJob Title: AI Software Engineer (Generative AI & Cloud-Native Systems)Location: Hybrid (Bangalore)Experience: 3–4 years in software development + 1–2 years in AI (agentic/generative)About the RoleWe’re building scalable, production-grade full-stack AI applications where AI capabilities (like generative models and agentic workflows) are deeply...
-
AI Software Engineer
2 weeks ago
Bengaluru, India Blue Cloud Softech Solutions Limited Full timeJob Title: AI Software Engineer (Generative AI & Cloud-Native Systems)Location: Hybrid (Bangalore)Experience: 3–4 years in software development + 1–2 years in AI (agentic/generative)About the RoleWe’re building scalable, production-grade full-stack AI applications where AI capabilities (like generative models and agentic workflows) are deeply...