Observability/AlOps

4 days ago


Hyderabad, Telangana, India IntraEdge Full time
L2- Observability/AIOps (5 to 8 yrs exp).

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance. SRE is a mindset, and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation. As a Site Reliability Engineer with focus on observability you will build and operate next generation observability platforms.

As an SRE with Observability focus you will:

● Explore the complex IT estates of our clients to understand their observability/AIOps opportunities, identify the areas to improvise

● Collaborate to architect unified observability and AIOps strategies which employ leading AI technology

● Implement enterprise observability/AIOps technology and processes

● Amplify observability/AIOps outcomes by accelerating adoption across technology and business organizations

Responsibilities include:

● Architect observability solutions to address the gaps in order to reduce organizational MTTD and MTTR objectives.

● Developing API-driven micro-services that combine into large and complex platforms

● Planning and executing highly parallel distributed object storage transformations and migrations

● Maintaining automated test suites using CI/CD tools

● Participating in collaborative projects with small software engineering teams

● Develop automation, processes, and tools designed to make our services simpler and more robust

● Participate in troubleshooting, capacity planning and analysis, performance analysis activities

● Advise management on service onboarding strategies and execution

Critical Hiring Criteria

What we are looking for:

● Entrepreneurs who seek challenging problems to solve

● Creativity, initiative and acute attention to detail

● Thirst for innovation and solving problems at lightning speed

● Passion for automating everything repetitive

● Obsession with software scalability and performance under high loads

● Love for using and contributing to open-source software

Please bring to the table:

● Experience in architecting complex IT solutions

● Understanding of observability dimensions(Metrics, logs, traces)

● Excellent communication and stakeholder management skills

● Development experience, comfortable working in multiple languages(Python, Java, Go and Ruby a plus)

● Experience working in collaborative coding environments (peer review, continuous integration, etc)

● 7+ years of application development

● Experience working in distributed remote teams across multiple time zones

● Experience in large scale operations environments

● 7+ years of experience with Linux/Unix development or systems administration

● 3+ years of experience with networking systems and technologies

● Deep understanding of network performance and security

● Ability to identify tasks which require automation and implement required automation

● Configuration Management tools experience with Puppet, Chef, SaltStack

● Hands-on operational experience in a high-volume or critical production service environment - distributed systems, capacity planning, continuous deployment

● BA/BS in Computer Science preferred, or equivalent experience (advanced degrees preferred)

We have opportunities to work with and learn:

● Object Storage - Minio/S3/etc

● Data Collection - OpenTelemetry/Grafana Alloy/etc

● Message Bus - Kafka/NSQ/etc

● Scaling Databases - Druid/Clickhouse/Cassandra/etc

● Relational database technologies at large scale - Timescale/Vitess/Postgres/etc

● Scheduling & Orchestration - Kubernetes/OpenShift/Docker

● Cloud Platforms - AWS/Azure
  • Observability/AlOps

    18 hours ago


    Hyderabad, Telangana, India IntraEdge Full time

    L2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...