Senior Site Reliability Engineer
Listed on 2026-03-01
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing
At Branch, we power every touchpoint with links that work and insights that prove it. From click to conversion, we make growth measurable. Our unparalleled attribution, backed by AI-enhanced linking, is trusted to deliver seamless experiences that increase ROI, decrease wasted spend, and eliminate siloed attribution.
We bring the same rigor to how we build our team, by empowering our people to move fast, own outcomes, and build something that matters. We take pride in making meaningful investments in our team’s health, wealth, and growth so individuals can thrive as we scale. Our culture values smart, humble, and collaborative teammates who take accountability and drive results in an environment where their work truly moves the business forward.
We are innovative, scaling with purpose, and led by seasoned leaders who know how to build enduring companies. Trusted by brands like Instacart, Western Union, NBCUniversal, Zoc Doc, and Sephora, we’re big enough to matter, small enough for you to make a real impact. If you’re excited by the grit of building, rapid learning, and shaping the future of customer growth, you’ll find your place here.
We are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of our large-scale, distributed infrastructure. You will lead design and execution of systems that power mission critical services, shaping engineering practices, influencing architectural decisions, and driving automation and resiliency across the organization.
As a Senior Site Reliability Engineer, you’ll get to:- Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale.
- Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs.
- Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene.
- Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies.
- Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security.
- Own and maintain key distributed data platforms, including Aerospike and Foundation
DB, ensuring durability, consistency, and performance. - Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org.
- Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency.
- Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles.
- Lead our Git Ops and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments.
- 6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments.
- Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems.
- Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals.
- Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling.
- Hands‑on experience with modern observability stacks (Prometheus, Grafana, Alert Manager, Loki, Pager Duty).
- Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, Foundation
DB, and the broader Hadoop ecosystem. - Deep experience with Terraform, Cloud Formation, or related IaC tooling, and the ability to guide teams in IaC best practices.
- Proven incident management leadership in production SaaS systems, including on‑call excellence, post‑mortem execution, and long‑term reliability improvements.
- Exceptional problem‑solving skills and the ability to lead complex investigations across multiple system…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).