Senior/Site Reliability Engineer Job San Francisco area,California USA,IT/Tech

Position: Senior/Staff Site Reliability Engineer

Mochi Health’s mission is to be the discovery layer of healthcare. We are building a platform that makes it easier for patients to find the right providers, access the right medications, and take control of their health with transparency and trust.

Over the past few years, we have experienced rapid growth by combining operational excellence, clinical expertise, and innovative technology to deliver care that is more human, intuitive, and effective. From pharmacy pricing transparency and personalized medication management, to long-term medical record access and community-based chronic illness support, Mochi is creating a new model of care that empowers patients, providers, and pharmacies alike.

We believe the future of healthcare is personal, and we are building the technology to power it. At Mochi Health, you will join a team that values inclusivity, collaboration, and bold thinking, and you will have the opportunity to do the most meaningful work of your career.

$230000 - $280,000 Full-time / Onsite (5 days/week)

About

The Role

We’re looking for a Senior/Staff Site Reliability Engineer to build Mochi’s AI‑driven APM and incident management system that alert and page, but learns. This is a foundational role at the intersection of SRE, platform engineering, and applied AI: you’ll design the feedback loops (human‑in‑the‑loop / RLHF‑style), guardrails, and automation that let our reliability posture improve over time.

You’ll own the systems and workflows that turn incidents into intelligence: automated triage, root cause analysis, remediation, and bug‑fix proposals (PRs, test runs, staged rollouts) when issues are code‑level.

If you’re excited by the idea of building a self‑improving SRE “copilot”, this job is for you.

What You’ll Do

Build an AI‑driven SRE platform that ingests telemetry (logs/metrics/traces), deploy events, and incident artifacts to detect anomalies, summarize failures, and propose mitigations.

Design a human‑in‑the‑loop learning loop (RLHF‑style) so the system gets better with every incident: capturing decisions, outcomes, and postmortems into training/evaluation data.

Create safe auto‑remediation capabilities: runbook execution, automated rollbacks, feature‑flag actions with strong guardrails, auditability, and progressive rollout controls.

Build tooling that can propose bug fixes: generate well‑scoped PRs, run tests, support canary releases—with clear handoff and approval flows.

Define and operationalize SLOs/SLIs and error budgets for critical user journeys (patient onboarding, provider workflows, pharmacy fulfillment, billing, etc.).

Level up observability end‑to‑end: alert quality, dashboarding, tracing standards, and “unknown unknown” detection.

Lead incident response excellence: on‑call improvements, incident command, blameless postmortems, and driving systemic fixes that reduce repeat failures.

Partner with product + engineering teams to reduce toil and improve reliability via better architecture, load testing, resilience testing, and capacity planning.

Establish reliability standards and patterns across the org (golden signals, deployment safety, dependency management, fault isolation).

Who You Are

7+ years in SRE / platform / infrastructure engineering, with a track record of owning production reliability at scale.

Deep experience operating systems in the cloud (AWS preferred), including networking, autoscaling, rollout strategies, and incident mitigation.

Strong software engineering ability—you can debug production issues across services, understand failure modes, and contribute code when needed (Python/Go/Type Script are all great).

Expert‑level grasp of observability and incident response: metrics, logs, tracing, alerting design, and postmortem‑driven improvements.

Comfortable building automation that touches production—and obsessive about safety: least‑privilege access, audit logs, approvals, canaries, and rollback.

Excited by AI tooling and agentic workflows (or already experienced): LLM‑based triage/summarization, retrieval over runbooks/postmortems, evaluation harnesses, and feedback loops.

Strong communication and collaboration skills—you can lead during incidents, write clearly,…


Increase/decrease your Search Radius (miles)



Job Posting Language

Senior​/Site Reliability Engineer

Senior/Site Reliability Engineer