Spydra - Site Reliability Engineer - DevOps

1 week ago


Hyderabad, India Spydra Full time

We are seeking a highly skilled and self-driven Site Reliability Engineer to join our dynamic team.

This role is ideal for someone with a strong foundation in Kubernetes, DevOps, and observability who can also support machine learning infrastructure, GPU optimization, and Big Data ecosystems.

You will play a pivotal role in ensuring the reliability, scalability, and performance of our production systems, while also enabling innovation across ML and data teams.


Key Responsibilities Automation & Reliability :


- Design, build, and maintain Kubernetes clusters across hybrid or cloud environments (e.g., EKS, GKE, AKS).

- Implement and optimize CI/CD pipelines using tools like Jenkins, ArgoCD, and GitHub Actions.

- Develop and maintain Infrastructure as Code (IaC) using Ansible, Terraform, or & Observability :


- Deploy and maintain monitoring, logging, and tracing tools (e.g., Thanos, Prometheus, Grafana, Loki, Jaeger).

- Establish proactive alerting and observability practices to identify and address issues before they impact users.


ML Ops & GPU Optimization :


- Support and scale ML workflows using tools like Kubeflow, MLflow, and TensorFlow Serving.

- Work with data scientists to ensure efficient use of GPU resources, optimizing training and inference & Incident Management :


- Lead root cause analysis for infrastructure and application-level incidents.

- Participate in the on-call rotation and improve incident response & Automation :


- Automate operational tasks and service deployment using Python, Shell, Groovy, or Ansible.

- Write reusable scripts and tools to improve team productivity and reduce manual Learning & Collaboration :


- Stay up-to-date with emerging technologies in SRE, ML Ops, and observability.

- Collaborate with cross-functional teams including engineering, data science, and security to ensure system integrity and :


- 3+ years of experience as an SRE, DevOps Engineer, or equivalent role.

- Strong experience with Kubernetes ecosystem and container orchestration.

- Proficiency in DevOps tooling including Jenkins, ArgoCD, and GitOps workflows.

- Deep understanding of observability tools, with hands-on experience using Thanos and Prometheus stack.

- Experience with ML platforms (MLflow, Kubeflow) and supporting GPU workloads.

- Strong scripting skills in Python, Shell, Ansible, or :


- CKS (Certified Kubernetes Security Specialist) certification.

- Exposure to Big Data platforms (e.g., Spark, Kafka, Hadoop).

- Experience with cloud-native environments (AWS, GCP, or Azure).

- Background in infrastructure security and compliance.


(ref:hirist.tech)

  • Hyderabad, India Spydra Full time

    DevOps Engineer : We are seeking for a DevOps Engineer where you will play a pivotal role in optimizing our development and deployment processes to ensure the reliability, scalability, and security of our systems. You will work closely with cross-functional teams to automate and streamline our operations and processes. The ideal candidate will have a strong...


  • Hyderabad, India Spydra Full time

    Key Responsibilities :- Building and maintaining the platform (platform layer, which is base for all other projects under decentralized or distributed category and all common modules)- Responsible for improving the performance and scale of the platform.- Would be working with team for building distributed systems at population scale.- Responsible for writing...


  • Hyderabad, India Talent Worx Full time

    Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services.Your work will involve both software engineering and systems operations as you strive to improve customer experiences and operational...


  • Hyderabad, India Spydra Full time

    Key Responsibilities : - Software Development : Design, develop, and implement software applications and systems in Rust. Ensure code quality, efficiency, and maintainability.- Code Optimization : Optimize existing Rust code for performance, scalability, and resource efficiency. Conduct code reviews to ensure best practices.- Technical Leadership : Provide...


  • Hyderabad, India Jigya Software Services Full time

    Job Title:Senior Site Reliability Engineer (SRE) - AWS/Kubernetes Location:Hyderabad - Onsite Job Type:Full-Time About the Role: We are looking for a highly skilled and motivated Site Reliability Engineer to design, build, and maintain our high-performance, scalable cloud infrastructure. You will play a critical role in ensuring the reliability, performance,...


  • Hyderabad, India Spydra Full time

    Job Summary :We are seeking a skilled Golang Developer to join our dynamic software development team.The ideal candidate will possess a deep understanding of the Go programming language and have experience in building scalable and efficient Responsibilities :- Design, develop, and implement new features and modules in our software applications using...


  • Hyderabad, Telangana, India Talent Worx Full time ₹ 20,00,000 - ₹ 25,00,000 per year

    SRE (Site Reliability Engineer)Talent Worx is seeking a talented SRE (Site Reliability Engineer) to enhance our technology team. In this role, you will be pivotal in ensuring the reliability, performance, and availability of our applications and services. Your work will involve both software engineering and systems operations as you strive to improve...


  • Hyderabad, Telangana, India Talent Worx Full time ₹ 15,00,000 - ₹ 25,00,000 per year

    Site Reliability Engineer (SRE)At Talent Worx, we are looking for a dedicated Site Reliability Engineer (SRE) to join our team. This role involves maintaining high availability and reliability of our services through the application of software engineering practices and systems administration skills. The ideal candidate will bridge the gap between...


  • Hyderabad, India Employ Full time

    Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering rolesLocation – Fully RemoteType - 6 months ContractWork Ex - 5+ YrsWe’re working with a AI product company that’s building the next generation of GenAI powered developer platforms.We’re looking for an experienced Site Reliability Engineer to join their Platform...


  • Hyderabad, Telangana, India INDIGLOBE IT SOLUTIONS PRIVATE LIMITED Full time

    Job Summary :We are looking for a Senior Site Reliability Engineer (SRE) to join our growing Engineering team. As an SRE, you will play a key role in ensuring the reliability, scalability, and performance of our production systems across a multi-cloud environment (GCP & AWS). Youll be responsible for owning application support, maintaining our microservices...