Sr Engineer, Site Reliability

3 days ago


hyderabad district, India TMUS Global Solutions Full time

About T-Mobile: T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience. TMUS Global Solutions: TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking. TMUS India Private Limited operates as TMUS Global Solutions. About the Role: As a Senior Site Reliability Engineer, you will be a key member of the CFL Platform Engineering and Operations team you will play a pivotal role in building and scaling intelligent infrastructure to support AI/ML applications, enterprise services, and LLM-based platforms. You will contribute to the design and implementation of observability frameworks, automation-first operations, and incident response strategies to ensure reliability, performance, and scalability across production systems. What You’ll Do: Implement and maintain observability, monitoring, and alerting systems for AI platforms and backend services Design and support telemetry pipelines, logging infrastructure, and dashboards (Splunk, Prometheus, Grafana, Open Telemetry) Define and monitor SLOs, SLIs, latency, availability, and throughput metrics Participate in on-call rotations, incident resolution, root cause analysis, and postmortems Improve CI/CD workflows and infrastructure automation using GitLab pipelines Optimize and scale infrastructure including Kafka, RMQ, HAProxy, and distributed APIs Collaborate with engineering teams on governance, compliance, and secure operations Support capacity planning, cost analysis, and tuning for high-scale performance Automate repetitive tasks and reduce toil via scripting (Python, Bash, Java) Contribute to runbooks, knowledge base articles, and SRE best practice documentation Mentor junior engineers and support a culture of operational excellence and reliability What You’ll Bring: Bachelor’s degree in Computer Science, Engineering, or a related technical field 4-7 years in SRE, DevOps, platform, or operations engineering roles Strong hands-on experience in observability, monitoring, and distributed systems troubleshooting Proficiency in scripting languages such as Python, Bash, or PowerShell CI/CD experience with GitLab and automation across deployment pipelines Solid understanding of SQL and NoSQL systems including Oracle DB and MongoDB Familiarity with Kubernetes, container orchestration, and hybrid cloud (Azure, AWS, GCP, OCI) Experience working in high-stakes, incident-driven environments Strong working knowledge of Splunk, Grafana, Prometheus, and other observability tools Understanding of AI/ML systems, inference APIs, and LLM infrastructure is a plus Experience in platform compliance, security enforcement, and regulated domains (finance preferred) Must Have Skills: Application & Microservice: Java, Spring boot, API & Service Design Any CI/CD Tools : Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI App Platform: Docker & Containers (Kubernetes) Any Databases : SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB) Any Messaging: Kafka, Rabbit MQ Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus) Incident/Change/Problem Management Nice To Have: Multi-region failover (SQL Server, MongoDB, vendors) Observability platform design (sampling, retention policies) Own domain SLOs and error budgets Perf engineering for latency-sensitive apps Toil automation (SRE bots, operators



  • hyderabad district, India Sonata Software Full time

    Role: Site Reliability Engineer Location: Hyderabad Notice Period: Immediate to 20 Days Employment Type: Full Time Experience 7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U) Primary Skills (Must-Have) AWS, CI/CD, Jenkins, IAAC,...

  • DevOps Engineer

    3 weeks ago


    Hyderabad, India Axceltran digital private limited Full time

    Description :Qualifications :- Proven experience as a Site Reliability Engineer, Sr DevOps Engineer, or similar role.- 5 to 7 years of Relevant experience, at least 2 years of experience in Microsoft Azure. Good to have AWS and GCP.- Experience in setting up and managing OTEL, using Loki, Tempo, Promotus, Grafana, Alloy etc.- Experience in creating CI/CD...


  • hyderabad district, India TMUS Global Solutions Full time

    About T-Mobile: T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience. About TMUS Global...


  • hyderabad district, India Atyeti Inc Full time

    Job Description : We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team. Bachelor’s degree in computer science, Engineering, or equivalent practical experience. 7+ years’ experience in Site Reliability deploying and managing large-scale distributed systems successfully. Understanding of SRE concepts (error...


  • bangalore district, India IntraEdge Full time

    Job Title: Site Reliability Engineer (SRE) – Production Support Location: Bengaluru Job Summary: We are looking for a skilled Site Reliability Engineer (SRE) with strong experience in production support, DevOps practices, and cloud infrastructure management . The ideal candidate will be responsible for maintaining the reliability, performance, and...


  • hyderabad district, India HTC Global Services Full time

    HTC – A brief profile Established in 1990, HTC Inc., a company with headquarters in Troy, Michigan, is a leading global Information Technology solution and BPO provider. HTC assists clients across multiple industry verticals, offering turnkey project lifecycle in, e-business, data warehousing, embedded systems, ECM, SCM, CRM, and ERP solutions. HTC Inc....


  • Hyderabad, Telangana, India Oracle Financial Services Software Ltd Full time ₹ 12,00,000 - ₹ 36,00,000 per year

    Principal Site Reliability Engineer Oracle is seeking motivated Principal Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment. This position requires wide and overall knowledge in Linux administration, AI technologies, software development, cloud computing, networking, cloud security, performance analysis and...


  • hyderabad district, India GSPANN Technologies, Inc Full time

    About Company : Headquartered in California, U.S.A., GSPANN provides consulting and IT services to global clients. We help clients transform how they deliver business value by helping them optimize their IT capabilities, practices, and operations with our experience in retail, high-technology, and manufacturing. With five global delivery centers and 2000+...


  • Hyderabad, India Insight Global, LLC Full time

    Job Title : Sr. SREAbout the Company : Insight Globals ClientType : Ongoing EOR, depending on experience levelLocation : ONSITE 4X/WEEK in HITEC City, Hyderabad, INPriority scheduling for candidates who : - Submit resume promptly- Are available for immediate interviews- Connect via LinkedIn with resume and CTC rateRequirements : - Ability to be onsite...


  • hyderabad district, India TMUS Global Solutions Full time

    About T-Mobile: T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience. About TMUS Global...