More jobs:
Senior Site Reliability Engineer
Job in
Los Angeles, Los Angeles County, California, 90009, USA
Listed on 2026-03-04
Listing for:
BuildOps
Full Time
position Listed on 2026-03-04
Job specializations:
-
IT/Tech
SRE/Site Reliability, Cloud Computing
Job Description & How to Apply Below
We're looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment. Reporting to the Dev Ops and SRE Manager, this is a hands-on role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.
What You'll Do
* Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
* Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
* Partner with product and engineering teams to design reliable services-reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations
* Help evolve and operate our AWS infrastructure (networking, compute, data stores) using Infrastructure as Code (Terraform)
* Contribute code to services, tooling, and automation (for example, reliability libraries, deployment and incident tooling, health checks)
* Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
* Participate in incident response for infrastructure-related production issues, including learning-focused post-incident reviews and follow-through on action items
* Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
* Advocate for and implement security and compliance best practices in production environments
* Document standards, playbooks, and best practices so reliability improvements scale across teams
* Collaborate closely with software engineers, product managers, and other stakeholders to plan and deliver reliability-focused initiatives What We Look For
* 5+ years of professional experience in Site Reliability Engineering, Dev Ops, Infrastructure Engineering, or production-focused Software Engineering, working on production systems and reliability-focused initiatives
* Proven experience leading multi-sprint, multi-engineer projects (for example, reliability, performance, or infrastructure initiatives) to successful completion with clear business impact Thorough understanding of, and hands-on experience with, modern SRE practices, such as: *
* Reducing toil through automation
* Safe deployment and rollout patterns
* Structured post-incident reviews and continuous improvement
* Software engineering experience: you've written and maintained production-quality code and can work comfortably in at least one modern language (for example, Python or Node.js/Type Script)
* Strong interest in, and experience with, using LLMs and AI-assisted tooling in your workflow, including the ability to validate and improve what they generate Strong observability skills, including: *
- Designing metrics, logging, and tracing for multi-service systems
* Building actionable dashboards and alerts with clear runbooks
* Correlating metrics, logs, and traces to debug complex issues
* Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor-agnostic experience is welcome)
* Experience working with AWS in production and with core platform primitives such as Terraform-based Infrastructure as Code and container/orchestration platforms (for example, Docker with ECS, EKS, or Kubernetes)…
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×