
Observability/AlOps
4 days ago
Hyderabad, Telangana, India
IntraEdge
Full time
L2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance. SRE is a mindset, and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation. As a Site Reliability Engineer with focus on observability you will build and operate next generation observability platforms.
As an SRE with Observability focus you will:
● Explore the complex IT estates of our clients to understand their observability/AIOps opportunities, identify the areas to improvise
● Collaborate to architect unified observability and AIOps strategies which employ leading AI technology
● Implement enterprise observability/AIOps technology and processes
● Amplify observability/AIOps outcomes by accelerating adoption across technology and business organizations
Responsibilities include:
● Architect observability solutions to address the gaps in order to reduce organizational MTTD and MTTR objectives.
● Developing API-driven micro-services that combine into large and complex platforms
● Planning and executing highly parallel distributed object storage transformations and migrations
● Maintaining automated test suites using CI/CD tools
● Participating in collaborative projects with small software engineering teams
● Develop automation, processes, and tools designed to make our services simpler and more robust
● Participate in troubleshooting, capacity planning and analysis, performance analysis activities
● Advise management on service onboarding strategies and execution
Critical Hiring Criteria
What we are looking for:
● Entrepreneurs who seek challenging problems to solve
● Creativity, initiative and acute attention to detail
● Thirst for innovation and solving problems at lightning speed
● Passion for automating everything repetitive
● Obsession with software scalability and performance under high loads
● Love for using and contributing to open-source software
Please bring to the table:
● Experience in architecting complex IT solutions
● Understanding of observability dimensions(Metrics, logs, traces)
● Excellent communication and stakeholder management skills
● Development experience, comfortable working in multiple languages(Python, Java, Go and Ruby a plus)
● Experience working in collaborative coding environments (peer review, continuous integration, etc)
● 7+ years of application development
● Experience working in distributed remote teams across multiple time zones
● Experience in large scale operations environments
● 7+ years of experience with Linux/Unix development or systems administration
● 3+ years of experience with networking systems and technologies
● Deep understanding of network performance and security
● Ability to identify tasks which require automation and implement required automation
● Configuration Management tools experience with Puppet, Chef, SaltStack
● Hands-on operational experience in a high-volume or critical production service environment - distributed systems, capacity planning, continuous deployment
● BA/BS in Computer Science preferred, or equivalent experience (advanced degrees preferred)
We have opportunities to work with and learn:
● Object Storage - Minio/S3/etc
● Data Collection - OpenTelemetry/Grafana Alloy/etc
● Message Bus - Kafka/NSQ/etc
● Scaling Databases - Druid/Clickhouse/Cassandra/etc
● Relational database technologies at large scale - Timescale/Vitess/Postgres/etc
● Scheduling & Orchestration - Kubernetes/OpenShift/Docker
● Cloud Platforms - AWS/Azure
-
Observability/AlOps
18 hours ago
Hyderabad, Telangana, India IntraEdge Full timeL2- Observability/AIOps (5 to 8 yrs exp).Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast...