Site Reliability Engineer
2 weeks ago
About AION
AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.
By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. The platform's innovative Proof of Compute Contribution (PoCC) protocol rewards contributors based on performance, creating a transparent and efficient ecosystem.
Integrated with Tether (USD₮ & USD₮0) for stability and regulatory clarity, AION eliminates volatility, ensuring predictable costs and seamless transactions. With cutting-edge partnerships and a USD-backed economy, AION is pioneering the commoditization of high-performance compute, empowering global innovation and bridging the AI wealth gap.
Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India.
Who you areYou are a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You thrive on building robust monitoring solutions and creating self-healing infrastructure. You understand the challenges of maintaining high availability across distributed systems and have experience implementing SRE best practices. You're passionate about creating production-ready environments that can scale efficiently and recover automatically from failures.
Technical Skills & Experience- 3-8 years of experience in Site Reliability Engineering or DevOps (exceptional candidates with different experience profiles will be considered)
- A Tier1 college education or previous work experience at FAANG/top startups is preferred but not required
- Cloud Platforms: Deep expertise with AWS, GCP, or Azure infrastructure services
- Kubernetes: Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
- Infrastructure as Code: Strong experience with Terraform, Pulumi, or similar IaC tools
- Observability: Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
- Service Mesh: Experience with Istio, Linkerd, or similar service mesh technologies
- Networking: Understanding of network architectures, DNS, load balancing, and security groups
- CI/CD: Knowledge of automated deployment pipelines and GitOps workflows
- Scripting: Proficiency in Bash, Python, or Go for automation scripts
- Container Technologies: Deep understanding of Docker, containerd, and OCI specifications
- Security: Knowledge of infrastructure security best practices and compliance requirements
- Incident Management: Experience with incident response, post-mortems, and developing SOP documentation
- Responsible for designing and implementing comprehensive monitoring and alerting systems across all AION platforms.
- Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes.
- Create and maintain runbooks and playbooks for handling common operational scenarios and incidents.
- Responsible for implementing service mesh solutions for observability, traffic management, and security.
- Design and implement logging systems that provide visibility into complex distributed systems.
- Responsible for capacity planning and resource optimization across cloud environments.
- Implement CI/CD pipelines for reliable and consistent deployments across all environments.
- Design and build self-healing systems that automatically recover from common failure modes.
- Develop infrastructure for both the compute platform and data annotation services with consistent reliability practices.
- Responsible for designing and implementing disaster recovery strategies and testing procedures.
- Create and maintain production, staging, and development environments with appropriate isolation.
- Collaborate with security teams to implement infrastructure security best practices and compliance requirements.
Individuals in this role are expected to relocate to Bangalore, though exceptions can be made. We offer a hybrid working setup with 3 days in-office setup. Employees would have flexibility to work from anywhere for a few months during a year.
Why Join Us- Be part of a mission-driven team at the intersection of web3 and AI, tackling some of the most exciting challenges in the industry.
- Join the ground floor of an AI startup, with the opportunity to make a significant impact on the company and the industry.
- Collaborate with top-tier talent from the tech industry.
- Competitive salary and benefits package.
- Flexible work environment with opportunities for professional growth and development.
If you are a skilled and motivated Site Reliability Engineer with a passion for building reliable, scalable infrastructure for cutting-edge compute systems, we would love to hear from you.
-
Site Reliability Engineering
1 week ago
Bengaluru, Karnataka, India Thakral One Full time US$ 60,000 - US$ 1,20,000 per yearCompany DescriptionThakral One, headquartered in Singapore, is a technology consulting and services company with a strong presence across Asia. The company specializes in technology-driven consulting, custom solution development, data analytics, and leveraging cloud capabilities to deliver enhanced decision support and practical outcomes. Collaborating...
-
Site Reliability Engineering
5 days ago
Bengaluru, Karnataka, India Viraaj HR Solutions Private Limited Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSite Reliability Engineer (SRE)About The OpportunityA fast-growing organization in the Enterprise Cloud Infrastructure & SaaS sector delivering highly available, mission-critical services to enterprise customers. We are hiring an on-site Site Reliability Engineer in India to own reliability, automation, and operational excellence across cloud-native...
-
Site Reliability Engineer
4 hours ago
Bengaluru, Karnataka, India super Full time ₹ 12,00,000 - ₹ 24,00,000 per yearSite Reliability Engineer (SRE) Level 3Overview:A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and highly reliable systems. This role emphasizes a blend of software and systems engineering to ensure the availability, latency, performance, and capacity...
-
Site Reliability Engineer
2 days ago
Bengaluru, Karnataka, India Zetamicron Full time ₹ 12,00,000 - ₹ 36,00,000 per yearJob Title: Site Reliability Engineer (SRE)About the RoleWe are seeking a highly skilled and proactive Site Reliability Engineer (SRE)to ensure the stability, scalability, and reliability of our platform. The ideal candidate will have strong experience in managing production environments, automating operational processes, and enhancing system performance...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per yearThis posting is for Site Reliability Engineer in the Oracle Analytics Warehouse product development organization. Fully handled Cloud service that provides customers a turn-key enterprise warehouse on the cloud for Fusion Applications. The service is being built on a sophisticated technology stack demonstrating a brand-new data integration platform and the...
-
Site Reliability Engineer
3 days ago
Bengaluru, Karnataka, India Chevron Full time ₹ 20,00,000 - ₹ 25,00,000 per yearTotal Number of Openings2About the position:Come join our Subsurface Digital Platform where we are driving continuous innovations to improve reliability, scalability and sustainability of Chevron business via Chevron's Digital Transformation. We are seeking a T-shaped dynamic Senior Site Reliability Engineer to lead and provide end-to-end solution support...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India Infogrowth Full time ₹ 15,00,000 - ₹ 25,00,000 per yearRole : SRE Engineer (Site Reliability Engineer) Location : Marathali Bangalore. Work Mode : Hybrid Mode (Weekly 3 days) Exp : 6 – 10 Years Required Candidate profileSkills :Python, AWS (EC2, IAM, Lambda, API Gateway, SNS, SQS & etc.), GITHUB Actions, Service Management, Incident Management etc. & CAPAs.Share resume on or
-
Site Reliability Engineer
4 days ago
Bengaluru, Karnataka, India Empower Full time ₹ 12,00,000 - ₹ 36,00,000 per yearOur vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and...
-
Site Reliability Engineer
1 week ago
Bengaluru, Karnataka, India d416f97b-2589-437a-8e64-3348cfe4008b Full time ₹ 12,00,000 - ₹ 36,00,000 per yearHiring Site Reliability EngineersExp : 2.5 +years [Excluding internship]Location : BangaloreApply Here : The engineer will work in the Reliability and Productivity Engineering team and is responsible for building industry standard large scale platforms to be utilised across FK that helps to significantly improve the reliability of systems and bring...
-
Site Reliability Engineer
4 days ago
Bengaluru, Karnataka, India Progress Full time ₹ 12,00,000 - ₹ 36,00,000 per yearWe are Progress (Nasdaq: PRGS) - the trusted provider of software that enables our customers to develop, deploy and manage responsible, AI-powered applications and experience with agility and ease.We're proud to have a diverse, global team where we value the individual and enrich our culture by considering varied perspectives because we believe people power...