Principal Engineer, Site Reliability

3 days ago


New Delhi, India ANSR Full time

About T-MobileT-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.TMUS Global SolutionsTMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.TMUS India Private Limited operates as TMUS Global Solutions.About the RoleAs a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.What You’ll Do- Architect observability and incident response pipelines for LLM, API, and backend systems - Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability - Lead high-severity incident response, root cause analysis, and system recovery - Collaborate with AI, Platform, and Security teams to enforce operational guardrails - Implement automation-first strategies using GitLab CI/CD, Terraform, and deployment tooling - Guide infrastructure tuning, capacity planning, and cost optimization - Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry - Support AIOps, model observability, policy enforcement, and audit readiness - Mentor senior SREs and foster a high-ownership, technical excellence cultureWhat You’ll Bring- Bachelor's or Master’s in Computer Science, Engineering, or related field - 7-12 years in SRE, infrastructure, or platform roles in distributed systems - Strong experience in incident management, AI/ML observability, and performance engineering - Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs - Proficiency in Python, Java, Bash/PowerShell, YAML - Deep knowledge of CI/CD workflows, GitLab pipelines, and SDLC processes - Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB - Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI - Familiarity with AIOps, latency scoring, policy validation, and secure AI operations - Background in compliance, governance, and enterprise risk management for AI systems - Advanced debugging skills across data, infrastructure, networking, and app layers - Leadership in chaos engineering, SLO-based operations, and system resilienceMust Have Skills- Application & Microservice: Java, Spring boot, API & Service Design - Any CI/CD Tools : Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI - App Platform: Docker & Containers (Kubernetes) - Any Databases : SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB) - Any Messaging: Kafka, Rabbit MQ - Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus) - Incident/Change/Problem ManagementNice To Have- Compliance-aligned continuity planning (PCI, SOX) - Error-budget pacts with product/org leadership - Executive Incident/Change/Problem /risk reporting - Observability cost vs coverage trade-offs - Org-wide reliability governance strategy



  • New Delhi, India Xebia Full time

    Performance & Reliability Engineer ( Senior, Lead , Principal & Manager) Hybrid Location: Pune, Chennai, Bangalore & Gurgaon Need immediate joiners onlyJob description Role: Performance & Reliability EngineerJob Location: Gurgaon, Chennai, Pune, BangaloreHybridJob Overview: We are seeking a highly skilled and motivatedPerformance & Reliability Engineerto...

  • Site Engineer

    1 week ago


    Delhi, Delhi, India Engineer Department Full time ₹ 6,00,000 - ₹ 12,00,000 per year

    Company DescriptionEngineer Department is a company We are dedicated to providing efficient and effective engineering solutions for public infrastructure and services. Our team is committed to ensuring the highest standards in project management and execution, serving the community with integrity and professionalism.Role DescriptionThis is a full-time...

  • Site Engineer

    3 weeks ago


    Delhi, India Engineer Department Full time

    Company Description Engineer Department is a company We are dedicated to providing efficient and effective engineering solutions for public infrastructure and services. Our team is committed to ensuring the highest standards in project management and execution, serving the community with integrity and professionalism. Role Description This is a full-time...


  • New Delhi, India Tata Consultancy Services Full time

    Dear Candidates,Greetings from TCS!!!TCS is looking for Senior Site Reliability Engineer – AWSExperience: 8-12 yearsLocation: ChennaiMust have skills:- Design, implement, and maintain scalable, secure, and highly available infrastructure on AWS - Develop and improve CI/CD pipelines, Infrastructure as Code (IaC) using Terraform, Harness - Own and implement...


  • Delhi, India Elgebra Full time

    Hiring: Site Reliability Engineer – 7+ Years Location: Bangalore / Chennai Payroll: Elgebra Client: Qincline Joining: Immediate to 15 Days Role Overview: We are looking for an experienced Site Reliability Engineer (SRE) with over 6 years of expertise to join our team. The ideal candidate will have strong technical skills, a problem-solving mindset, and...


  • New Delhi, India ValueMomentum Full time

    About the RoleWe are seeking an experienced Site Reliability / Azure DevOps Engineer with Dynatrace Experience to join our engineering team and contribute to scalable CI/CD practices, infrastructure automation, and cloud operations. The ideal candidate will have deep expertise in Azure DevOps, Infrastructure as Code (IaC), Azure services, and modern DevOps...


  • Delhi, India Concord Full time

    SRE Sr. Engineers (Individual Contributors)Key Attributes:- Strong SRE (Site Reliability Engineering) experience- DevOps skills – CI/CD, monitoring, automation, infrastructure as code, etc.- Excellent troubleshooting and debugging skills (infrastructure + application level)- Perseverance – must push through complex/challenging issues without giving up-...


  • New Delhi, India ANSR Full time

    ANSR is hiring for one of its clients.About T-Mobile:T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional...


  • New Delhi, India ANSR Full time

    ANSR is hiring for one of its clients.About T-Mobile:T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional...


  • New Delhi, India iVedha Inc. Full time

    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering PracticeLocation: India (Remote) - Must be available to work in the EST (US/Canada) Time Zone.Role Summary:Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?We're looking for an SRE with 7+...