Datacenter Observability and Site Reliability Engineer

6 days ago


India Tekgence Inc Full time

Datacenter Observability and Site Reliability Engineer

Location:Remote, India

contract Duration: 6 months+

working hours: 5.30 am to 2.30 pm IST

Roles and Responsibilities:

Observability and Monitoring:

  • Design, implement, and maintain observability solutions for datacenter infrastructure.
  • Develop, deploy, and maintain the operational and reliability components of a large-scale Observability and Telemetry collection platform, emphasizing performance at scale, real-time monitoring, logging, and alerting.
  • Participate in and enhance the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
  • Develop and optimize monitoring systems to ensure high availability and performance.
  • Create and manage dashboards, alerts, and reports to provide visibility into system health and performance.

Site Reliability Engineering (SRE):

  • Implement SRE best practices to improve the reliability, scalability, and performance of datacenter services.
  • Develop and maintain automation scripts for infrastructure provisioning, monitoring, and management.
  • Conduct root cause analysis and post-mortem reviews to prevent recurrence of incidents.

Performance Optimization:

  • Analyze and optimize the performance of datacenter systems and applications.
  • Implement best practices for resource utilization and efficiency.

Collaboration:

  • Work closely with other engineering teams to understand and meet their observability and reliability requirements.
  • Collaborate with hardware and software vendors to evaluate and integrate new technologies.

Security and Compliance:

  • Ensure that observability and reliability solutions comply with security policies and industry standards.
  • Implement and maintain security measures to protect data and infrastructure.

Troubleshooting and Support:

  • Provide support for observability and reliability-related issues, including debugging and resolving hardware and software problems.
  • Develop and maintain documentation for troubleshooting procedures and best practices.

Continuous Improvement:

  • Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure.
  • Continuously improve the reliability, scalability, and performance of datacenter services.

Qualifications:

Education:

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Experience:

  • 8+ years of experience in datacenter observability and site reliability engineering.
  • Proven experience in managing and optimizing large-scale datacenter environments.

Technical Skills:

  • Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).
  • Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).
  • Strong programming and scripting skills (e.g., Python, Go, Bash).
  • Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and reliability services.

Soft Skills:

  • Strong problem-solving skills and attention to detail.
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, dynamic environment.


  • India Tekgence Inc Full time

    Job Title: Datacenter Observability and Site Reliability EngineerLocation: Remote, IndiaDuration: 6 months+ likely to be extendedTimings: 5:30 AM to 2:30 PM IST**Key Requirements**5+ Observability Engineering with deep understanding of the Grafana software stack and who has experienced in building and maintaining large, scaled enterprise observability...


  • India Tekgence Inc Full time

    Job DescriptionWe are seeking a highly skilled Datacenter Observability and Infrastructure Reliability Specialist to join our team at Tekgence Inc.About the RoleIn this critical role, you will be responsible for designing, implementing, and maintaining observability solutions for datacenter infrastructure. You will also develop, deploy, and maintain...


  • India HARP Technologies and Services Full time

    About UsWe are HARP Technologies and Services, a company dedicated to delivering high-quality solutions in Site Reliability Engineering (SRE).Job DescriptionWe are seeking a highly skilled Senior Site Reliability Engineer with extensive experience in cloud infrastructure and monitoring tools.The ideal candidate will have 12 years of experience, including 7...


  • India HARP Technologies and Services Full time

    Overview:At HARP Technologies and Services, we are committed to delivering high-quality software solutions that meet the needs of our customers. To achieve this goal, we need a talented and experienced Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for ensuring the reliability,...


  • India FourthPointer Services Pvt. Ltd. Full time

    Job Title : Site Reliability Engineer (SRE) Experience Required : 5 years Location : Noida (Remote) Job Description : We are looking for an experienced Infrastructure Site Reliability Engineer (SRE) to join our team. This role involves managing and optimizing infrastructure with a primary focus on Kafka, OpenSearch, and multi-cloud environments. Key...


  • India HARP Technologies and Services Full time

    Experience : 8 Years Location : Mumbai,Chennai (Other cities Remote) Notice period : Immediate to 30 days max Responsibilities of Senior SRE : - The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability, stability and performance of systems and services. - They work with cross-functional teams to design, build and maintain...


  • India Ascendion Full time

    Job Description : We are looking for an experienced Azure Site Reliability Engineer (SRE) with 6-9 years of experience to support and administer Azure Kubernetes Service (AKS) clusters running critical middleware handling thousands of transactions per second (TPS). The ideal candidate will have a strong background in Infrastructure as Code (IaC), cloud...


  • India The Sourcing Team Pvt Ltd Full time

    About the JobThe Sourcing Team Pvt Ltd is seeking a Senior Network Engineer - Datacenter to join our team. As a senior network engineer, you will be responsible for designing, implementing, and managing datacenter network infrastructure. This includes monitoring and troubleshooting network issues, configuring network devices, and ensuring high availability...


  • India Pro5 Full time

    Job Opening: Site Reliability Engineer (SRE) - US Hours Coverage Location: Remote Shift: 8 AM - 8 PM EST (US Timezone Coverage) Overview: ReturnKey is an early-stage startup backed by leading VCs, revolutionizing zero-waste retail in the US. We're tackling the country's massive retail returns and overstock problem with our cutting-edge AI-powered Recommerce...


  • India Agivant Technologies Full time

    Job Description : We are looking for a highly skilled Site Reliability Engineer (SRE) with strong engineering and architectural expertise to design, implement, and manage large-scale, mission-critical infrastructure across multiple data centers and cloud providers. As an SRE, you will be responsible for architecting and optimizing our global infrastructure,...


  • India Awign Full time

    About Awign Expert : Awign Expert is an enterprise-focused platform that helps businesses Hire, Assess and Manage highly skilled resources for Gig Based Projects. We provide our Experts a gateway to work for and build a freelance/consulting career with large-scale Enterprises. We are a newly launched business division of Awign, which is one of the pioneers...


  • India Buncha Full time

    About the Role:We are seeking a passionate and detail-oriented Site Reliability Engineer to join our dynamic team. The ideal candidate will have 3+ years of experience in system monitoring, reliability, and troubleshooting applications. You will play a crucial role in ensuring the availability, performance, and scalability of our systems.Key...


  • India IVedha Inc. Full time

    Site Reliability Engineer (SRE) Level 3 with CRE and Automation Expertise Position Overview: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) Level 3 with strong expertise in Python, advanced proficiency in Azure-based infrastructure, and significant experience in Customer Reliability Engineering (CRE) and Automation.The...


  • India iVedha Inc. Full time

    Site Reliability Engineer (SRE)//**Remote in India and have to work in EST (US/Canada) Time Zone with 24*7 Support Model**//Position Overview:We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) with strong expertise in Python, advanced proficiency in Azure-based infrastructure, and significant experience in Customer Reliability...


  • India iVedha Inc. Full time

    Site Reliability Engineer (SRE) //**Remote in India and have to work in EST (US/Canada) Time Zone with 24*7 Support Model **// Position Overview: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) with strong expertise in Python, advanced proficiency in Azure-based infrastructure, and significant experience in Customer...


  • India Agivant Technologies Full time

    Job Description : We are looking for a highly skilled Site Reliability Engineer (SRE) with strong engineering and architectural expertise to design, implement, and manage large-scale, mission-critical infrastructure across multiple data centers and cloud providers. As an SRE, you will be responsible for architecting and optimizing our global infrastructure,...


  • India noon Full time

    Job Title: Site Reliability EngineerLocation: Dubai, United Arab EmiratesAbout noon noon.com is a technology leader with a simple mission: to be the best place to buy and sell things. In doing this we hope to accelerate the digital economy of the Middle East, empowering regional talent and businesses to meet the full range of consumers' online needs. noon...


  • India Newfold Digital Full time

    Job DescriptionOverviewWe are looking for a Site Reliability Engineer Linux, who approaches their work with passion, a hunger for learning and growth, and a steadfast commitment to delivering outstanding results. If you're a team player with a positive mindset, keen to make a meaningful impact, we encourage you to reach out to usNewfold Digital is a leading...


  • India The Sourcing Team Pvt Ltd Full time

    Job SummaryThe Sourcing Team Pvt Ltd is seeking a highly skilled Datacenter Network Management Expert to join our team. As a datacenter network management expert, you will be responsible for managing and maintaining datacenter network infrastructure, including monitoring and troubleshooting network issues, configuring network devices, and ensuring high...


  • India HARP Technologies and Services Full time

    About the Role :We are seeking an experienced Staff Engineer to join our Site Reliability Engineering (SRE) team at HARP Technologies and Services. As a Staff Engineer, you will be responsible for ensuring the reliability, performance, and scalability of our cloud-based systems.Duties and Responsibilities :Design, implement, and maintain highly scalable and...