Staff Site Reliability Engineer
2 weeks ago
Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure payments network, enabling individuals, businesses, and economies to thrive while driven by a common purpose – to uplift everyone, everywhere by being the best way to pay and be paid.
Make an impact with a purpose-driven industry leader. Join us today and experience Life at Visa.
Job DescriptionExpert-level proficiency operating large-scale, distributed, mission-critical systems: designing for high availability, multi-region resiliency, low latency, and predictable performance under extreme load.
SRE fundamentals at Staff level: defines and drives SLOs/SLIs, error budgets, availability targets, and capacity guardrails codifies reliability requirements into design reviews and change-management gates.
Deep hands-on with Kubernetes and container platforms: multi-cluster operations, workload placement, HPA/VPA, pod disruption budgets, network policies, admission control, service mesh (Istio/Linkerd), and progressive delivery (blue/green, canary, feature flags).
Infra as Code and GitOps: Terraform (and/or Pulumi), Helm/Kustomize, Argo CD/Flux builds reusable modules, policy-as-code (OPA/Conftest), environment drift detection, and automated remediation.
Observability at scale: OpenTelemetry instrumentation/tracing, metrics (Prometheus), logging (ELK/OpenSearch), distributed tracing (Jaeger/Tempo/Zipkin), dashboards and SLO burn-rate alerts (Grafana) designs actionable alerts with runbook automation.
Proven incident leadership: serves as Incident Commander for P0/P1 events, coordinates cross-functional response, stabilizes systems, restores service quickly, and drives blameless postmortems with measurable follow-through.
Performance engineering and capacity planning: load and resilience testing, GC/heap and thread tuning (for JVM services), profiling (CPU, memory, IO), caching strategies, queue backpressure, and cost-aware capacity models.
- Strong systems and networking: Linux internals, filesystems, TCP/UDP, TLS/mTLS, HTTP/2/3, DNS, BGP/Anycast concepts, L4–L7 load balancing (Envoy/HAProxy/NGINX), CDN/edge (Cloudflare/Fastly/Akamai), WAF, and DDoS mitigation.
- Data/store reliability: operational experience with relational (PostgreSQL/MySQL/Oracle) and NoSQL (Cassandra/DynamoDB/MongoDB), streaming platforms (Kafka/Pulsar/Kinesis), and distributed caches (Redis/Hazelcast) backup/restore, consistency models, compaction/retention tuning, and multi-AZ/region failover.
- Cloud and platform engineering: AWS/Azure/GCP core services, VPC design, IAM/RBAC, KMS, secrets management (Vault), service catalog, golden images/base containers, and paved-road platforms for developers.
- Release engineering and CI/CD: Jenkins/GitHub Actions/GitLab CI, artifact/signing/SBOM, canary analysis, automated rollbacks, deployment safety checks, and change failure rate/MTTR improvements.
- Reliability-by-design partnership: participates in and leads architecture/design reviews, threat modeling, and resilience patterns (bulkheads, circuit breakers, idempotency, retry/backoff, dead-letter handling).
- Disaster recovery and business continuity: RTO/RPO objectives, runbooks, game days/chaos experiments (Litmus/Gremlin), regional evacuation, and active-active/active-passive strategies.
- Security in depth for production systems: least privilege, workload identity, image and dependency scanning, supply-chain hardening (SLSA), SBOM, network segmentation/zero trust, and PCI-DSS-aligned operational controls.
- Strong programming and automation: production-grade Go and/or Python (plus Bash), contributing SRE tooling, controllers/operators, and APIs code reviews, testing, and docs-as-code.
- Effective communicator and influencer: aligns reliability strategy with business outcomes, mentors engineers, challenges assumptions with data, and proposes pragmatic, incremental improvements.
- Experience leveraging GenAI/LLMs as copilots: accelerating runbook authoring, alert triage, knowledge retrieval, and post-incident synthesis with appropriate guardrails and data security.
- Nice to have: JVM and runtime tuning experience traffic engineering at Internet scale mobile edge/network reliability considerations.
This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.
QualificationsBasic Qualifications
5+ years of relevant work experience with a Bachelor's Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 0 years of work experience with a PhD, OR 8+ years of relevant work experience.
Preferred Qualifications
5+ years of relevant work experience with a Bachelor's Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 0 years of work experience with a PhD, OR 8+ years of relevant work experience.
Demonstrated ownership of SLOs/error budgets and production change risk management for tier-1 services.
Production experience with Kubernetes at scale, service mesh, and at least one major cloud provider (AWS/Azure/GCP).
Proficiency with Terraform and GitOps workflows strong coding skills in Go and/or Python.
Hands-on with observability stacks (Open Telemetry + Prometheus/Grafana + ELK/OpenSearch + one commercial APM/log platform).
Track record as Incident Commander and author of high-quality postmortems that drove systemic fixes.
Experience with streaming platforms (Kafka/Pulsar), distributed datastores (Cassandra/DynamoDB), and caching (Redis).
Familiarity with PCI-DSS or similarly stringent compliance environments.
Excellent communication, stakeholder management, and mentoring abilities.
Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.
-
Staff Site Reliability Engineer
1 week ago
Bengaluru, Karnataka, India Okta Full time ₹ 8,00,000 - ₹ 24,00,000 per yearJoin our team Were building a world where Identity belongs to you.Oktas Workforce Identity Cloud Security Engineering group is looking for a Staff Site Reliability Engineer with a passion for DevSecOps , Infrastructure Security , and SRE . Join a team that is not just building solutions but redefining the standards for cloud security. If you have a proven...
-
Site Reliability Engineering
1 week ago
Bengaluru, Karnataka, India Thakral One Full time US$ 60,000 - US$ 1,20,000 per yearCompany DescriptionThakral One, headquartered in Singapore, is a technology consulting and services company with a strong presence across Asia. The company specializes in technology-driven consulting, custom solution development, data analytics, and leveraging cloud capabilities to deliver enhanced decision support and practical outcomes. Collaborating...
-
Site Reliability Engineering
7 days ago
Bengaluru, Karnataka, India Viraaj HR Solutions Private Limited Full time ₹ 12,00,000 - ₹ 36,00,000 per yearSite Reliability Engineer (SRE)About The OpportunityA fast-growing organization in the Enterprise Cloud Infrastructure & SaaS sector delivering highly available, mission-critical services to enterprise customers. We are hiring an on-site Site Reliability Engineer in India to own reliability, automation, and operational excellence across cloud-native...
-
Site Reliability Engineer
2 days ago
Bengaluru, Karnataka, India super Full time ₹ 12,00,000 - ₹ 24,00,000 per yearSite Reliability Engineer (SRE) Level 3Overview:A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and highly reliable systems. This role emphasizes a blend of software and systems engineering to ensure the availability, latency, performance, and capacity...
-
Staff Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India Zscaler Full time ₹ 8,00,000 - ₹ 24,00,000 per yearAbout ZscalerServing thousands of enterprise customers around the world including 45% of Fortune 500 companies, Zscaler (NASDAQ: ZS) was founded in 2007 with a mission to make the cloud a safe place to do business and a more enjoyable experience for enterprise users. As the operator of the world's largest security cloud, Zscaler accelerates digital...
-
Staff Site Reliability Engineer
2 days ago
Bengaluru, Karnataka, India Zinnia Full time ₹ 12,00,000 - ₹ 36,00,000 per yearWHO WE ARE: Zinnia is the leading technology platform for accelerating life and annuities growth. With innovative enterprise solutions and data insights, Zinnia simplifies the experience of buying, selling, and administering insurance products. All of which enables more people to protect their financial futures. Our success is driven by a commitment to three...
-
Staff Site Reliability Engineer
2 days ago
Bengaluru, Karnataka, India Zinnia Full time ₹ 12,00,000 - ₹ 36,00,000 per yearWHO WE ARE:Zinnia is the leading technology platform for accelerating life and annuities growth. With innovative enterprise solutions and data insights, Zinnia simplifies the experience of buying, selling, and administering insurance products. All of which enables more people to protect their financial futures. Our success is driven by a commitment to three...
-
Site Reliability Engineer
3 days ago
Bengaluru, Karnataka, India Zetamicron Full time ₹ 12,00,000 - ₹ 36,00,000 per yearJob Title: Site Reliability Engineer (SRE)About the RoleWe are seeking a highly skilled and proactive Site Reliability Engineer (SRE)to ensure the stability, scalability, and reliability of our platform. The ideal candidate will have strong experience in managing production environments, automating operational processes, and enhancing system performance...
-
Site Reliability Engineer
2 weeks ago
Bengaluru, Karnataka, India Oracle Full time ₹ 12,00,000 - ₹ 36,00,000 per yearThis posting is for Site Reliability Engineer in the Oracle Analytics Warehouse product development organization. Fully handled Cloud service that provides customers a turn-key enterprise warehouse on the cloud for Fusion Applications. The service is being built on a sophisticated technology stack demonstrating a brand-new data integration platform and the...
-
Site Reliability Engineer
5 days ago
Bengaluru, Karnataka, India Chevron Full time ₹ 20,00,000 - ₹ 25,00,000 per yearTotal Number of Openings2About the position:Come join our Subsurface Digital Platform where we are driving continuous innovations to improve reliability, scalability and sustainability of Chevron business via Chevron's Digital Transformation. We are seeking a T-shaped dynamic Senior Site Reliability Engineer to lead and provide end-to-end solution support...