
▷ Apply Now: Senior Site Reliability Engineer
4 days ago
Job Description
Job Description
Join us as we pursue our ground-breaking vision to make machine data accessible, usable, and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we are committed to our work, customers, having fun, and most significantly to each other's success.
The Splunk Observability Cloud provides full-fidelity monitoring and fixing across infrastructure, applications, and user interfaces, in real-time and at any scale, to help our customers keep their services reliable, innovate faster, and deliver great customer experiences. Site Reliability Engineers at Splunk are cloud-native systems engineers who use infrastructure-as-code, microservices, automation, and efficient design to build, operate, and scale our products.
Role
You will help us run one of the largest and most sophisticated cloud-scale, bigdata, and microservices platforms in the world. You will be responsible for enabling developers to operate highly available, scalable, and cost-efficient applications with low operational burden by handling and improving the reliability and resiliency of SRE-managed services and infrastructure. You thrive on automation, infrastructure-as-code, reliability engineering, and getting rid of tedious, manual tasks.
You will
- Develop new processes to make the team more efficient and effective.
- Collaborate with other team leaders to orchestrate large system changes.
- Design new services, tools, and monitoring to be implemented by the entire team.
- Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
- Mentor new engineers to achieve more than they thought possible. You enjoy making other teams successful and are fulfilled through the success of others.
Work on reliability projects, including
- HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO
- Chaos engineering
- Application uptime and performance
- Capacity management & planning
- SLIs, SLOs, error budgets, and monitoring dashboards
- Responsible for deployment and operations of large-scale distributed data stores and streaming services
- Establishing design patterns for monitoring and benchmarking
- Establishing and documenting production run books and guidelines for developers
- Tooling, toil reduction, runbooks & automation to handle production environments
- Incident management and improving MTTD/MTTR for services
- Cloud cost optimization
Qualifications
Must-Have
- 8+ years of SRE experience in handling large-scale cloud-native microservices platforms.
- 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
- Experience with infrastructure automation and scripting using Python and/or bash scripting.
- Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
- Experience with deployment, operations and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
- Experience leading large-scale technical initiatives across multiple teams.
- Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
Preferred
- AWS Solutions Architect certification preferred.
- Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred
- Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
- Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
- Experience with one or more security/compliance frameworks such as SOC2, PCI, and/or FedRAMP.
- Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
The team operates on a 7-day coverage model, and as a result, our support engineers are occasionally asked to work on weekends.
Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.
Splunk, a Cisco company, is an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis.
Note
-
Site Reliability Engineer
1 week ago
india Synechron Full timeWe have immediate opportunity forSRE (Senior Site Reliability Engineer) 5 to 9 years. Synechron –BangaloreJob Role: -SRE (Senior Site Reliability Engineer) Job Location: -Bangalore Notice Period:Within 30daysAbout Synechron We began life in 2001 as a small, self-funded team of technology specialists. Since then, we’ve grown our organization to 14,500+...
-
Senior II Site Reliability Engineer
4 days ago
India Akamai Technologies Full timeJob Description Job Description Do you have the passion to architect and lead the next generation of public cloud infrastructure Would you like to lead modernization initiatives while building a public cloud platform from scratch Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate infrastructure and services that power...
-
Bengaluru, India Chase Bank Full timeJob Description Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view. As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you...
-
Senior Site Reliability Engineer
1 week ago
Hyderabad, Telangana, India Thomson Reuters Full timeAs a senior site reliability engineer will work in our global organization to provide operational support for all Thomson Reuters products including development tools and infrastructure used by engineering teams to build and test their applications They will also collaborate with engineering teams on continuous integration continuous deployment CI CD ...
-
Senior Site Reliability Expert
2 weeks ago
Hyderabad, Telangana, India beBeeSite Full time ₹ 2,24,00,000 - ₹ 3,51,20,000About Our Senior Site Reliability ExpertThe role of a senior site reliability expert is pivotal in ensuring the stability, scalability, and operational excellence of accounting and finance systems.Key ResponsibilitiesOperational Oversight: As a senior site reliability expert, you will be responsible for overseeing day-to-day operations for accounting and...
-
Senior II Site Reliability Engineer
1 week ago
India Akamai Full timeDo you have the passion to architect and lead the next generation of public cloud infrastructure? Would you like to lead modernization initiatives while building a public cloud platform from scratch? Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate infrastructure and services that power the backbone of our cloud...
-
Site Reliability Engineer
7 days ago
Hyderabad, India Jigya Software Services Full timeJob Title:Senior Site Reliability Engineer (SRE) - AWS/Kubernetes Location:Hyderabad - Onsite Job Type:Full-Time About the Role: We are looking for a highly skilled and motivated Site Reliability Engineer to design, build, and maintain our high-performance, scalable cloud infrastructure. You will play a critical role in ensuring the reliability, performance,...
-
Site Reliability Engineer III
2 days ago
Hyderabad, India Chase Bank Full timeJob Description There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Consumer and Community Banking, youwill solve complex and broad...
-
Senior Site Reliability Engineer
7 days ago
Hyderabad, India Options Executive Search Private Limited Full timeJob Title: SRE Lead Engineer Location: Hyderabad, India We are seeking a DevOps / SRE Lead Engineer to architect and scale our client's multi-tenant SaaS platform with AI/ML at the core.. Our client, a fast-growing AI-powered SaaS company in the FinTech space, is looking for aSite Reliability Engineering (SRE) Lead Engineerto join their dynamic team. This is...
-
Site Reliability Engineer II
7 days ago
India Akamai Full timeAre you passionate about Linux and automation at scale? Would you like to own critical services in a new public cloud platform? Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate infrastructure and services that power the backbone of our cloud platform. This is a rare opportunity to help build a public cloud from the...