Site Reliability Engineer

2 days ago


bangalore, India AION Full time

About AIONAION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. The platform's innovative Proof of Compute Contribution (PoCC) protocol rewards contributors based on performance, creating a transparent and efficient ecosystem.Integrated with Tether (USD₮ & USD₮0) for stability and regulatory clarity, AION eliminates volatility, ensuring predictable costs and seamless transactions. With cutting-edge partnerships and a USD-backed economy, AION is pioneering the commoditization of high-performance compute, empowering global innovation and bridging the AI wealth gap.Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India.Who you areYou are a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You thrive on building robust monitoring solutions and creating self-healing infrastructure. You understand the challenges of maintaining high availability across distributed systems and have experience implementing SRE best practices. You're passionate about creating production-ready environments that can scale efficiently and recover automatically from failures.Technical Skills & Experience3-8 years of experience in Site Reliability Engineering or DevOps (exceptional candidates with different experience profiles will be considered) A Tier1 college education or previous work experience at FAANG/top startups is preferred but not required Cloud Platforms: Deep expertise with AWS, GCP, or Azure infrastructure services Kubernetes: Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting Infrastructure as Code: Strong experience with Terraform, Pulumi, or similar IaC tools Observability: Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack Service Mesh: Experience with Istio, Linkerd, or similar service mesh technologies Networking: Understanding of network architectures, DNS, load balancing, and security groups CI/CD: Knowledge of automated deployment pipelines and GitOps workflows Scripting: Proficiency in Bash, Python, or Go for automation scripts Container Technologies: Deep understanding of Docker, containerd, and OCI specifications Security: Knowledge of infrastructure security best practices and compliance requirements Incident Management: Experience with incident response, post-mortems, and developing SOP documentation Key Responsibilities Responsible for designing and implementing comprehensive monitoring and alerting systems across all AION platforms. Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes. Create and maintain runbooks and playbooks for handling common operational scenarios and incidents. Responsible for implementing service mesh solutions for observability, traffic management, and security. Design and implement logging systems that provide visibility into complex distributed systems. Responsible for capacity planning and resource optimization across cloud environments. Implement CI/CD pipelines for reliable and consistent deployments across all environments. Design and build self-healing systems that automatically recover from common failure modes. Develop infrastructure for both the compute platform and data annotation services with consistent reliability practices. Responsible for designing and implementing disaster recovery strategies and testing procedures. Create and maintain production, staging, and development environments with appropriate isolation. Collaborate with security teams to implement infrastructure security best practices and compliance requirements. LocationIndividuals in this role are expected to relocate to Bangalore, though exceptions can be made. We offer a hybrid working setup with 3 days in-office setup. Employees would have flexibility to work from anywhere for a few months during a year.Why Join UsBe part of a mission-driven team at the intersection of web3 and AI, tackling some of the most exciting challenges in the industry. Join the ground floor of an AI startup, with the opportunity to make a significant impact on the company and the industry. Collaborate with top-tier talent from the tech industry. Competitive salary and benefits package. Flexible work environment with opportunities for professional growth and development. If you are a skilled and motivated Site Reliability Engineer with a passion for building reliable, scalable infrastructure for cutting-edge compute systems, we would love to hear from you.



  • bangalore, India ViewSonic Full time

    Job Requirements:Bachelor's degree in Computer Science, Engineering, or a related field.3+ year of experience in a relevant role, such as Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory.Basic understanding of AWS solutions including EC2, S3, CloudWatch, Lambda, and RDS.Interest and understanding of Platform Engineering...


  • Bangalore, India Aqilea (formerly Soltia) Full time

    We are a consulting company with a bunch of technology-interested and happy people!We love technology, we love design and we love quality. Our diversity makes us unique and creates an inclusive and welcoming workplace where each individual is highly valued.With us, each individual is her/himself and respects others for who they are and we believe that when a...


  • bangalore, India WhiteLotus Talent Partners Full time

    We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes. In this role, you will focus on monitoring, basic troubleshooting, and incident response, helping to maintain high system...


  • Bangalore, India Aqilea (formerly Soltia) Full time

    We are a consulting company with a bunch of technology-interested and happy people! We love technology, we love design and we love quality. Our diversity makes us unique and creates an inclusive and welcoming workplace where each individual is highly valued.With us, each individual is her/himself and respects others for who they are and we believe that when...


  • bangalore, India Progress Full time

    We are Progress (Nasdaq: PRGS) - the trusted provider of software that enables our customers to develop, deploy and manage responsible, AI-powered applications and experience with agility and ease.We're proud to have a diverse, global team where we value the individual and enrich our culture by considering varied perspectives because we believe people power...


  • bangalore, India JRD Systems Full time

    Site Reliability Engineer (Windows / Cloud / Automation)Job Summary:We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments. The ideal candidate will be responsible for designing, implementing, automating, and maintaining scalable infrastructure solutions across AWS, Azure,...


  • bangalore, India Cyberhaven Full time

    About the roleWe're looking for an experienced Site Reliability engineer for making sure systems are reliable, scalable, and performing well especially in production environments. Our technology is new and rapidly evolving as an early member on the team, you'll play a key role in shaping the reliability architecture, building scalable infrastructure, and...


  • bangalore, India Tata Consultancy Services Full time

    Role**: Manager, Site Reliability EngineeringRequired Technical Skill Set: Manager, Site Reliability EngineeringDesired Experience Range: 12 - 18 yrsNotice Period: Immediate to 90Days onlyLocation of Requirement: BangaloreWe are currently planning to do a Virtual Interview Job Description:Describe what the person will do in the role - how he/she will impact...


  • bangalore, India JRD Systems Full time

    Site Reliability Engineer (Windows / Cloud / Automation) Job Summary: We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments. The ideal candidate will be responsible for designing, implementing, automating, and maintaining scalable infrastructure solutions across AWS, Azure,...


  • Bangalore, India JRD Systems Full time

    Site Reliability Engineer (Windows / Cloud / Automation) Job Summary: We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments. The ideal candidate will be responsible for designing, implementing, automating, and maintaining scalable infrastructure solutions across AWS, Azure,...