Site Reliability Engineer Job San Francisco area,California USA,IT/Tech

Site Reliability Engineer (AI Infrastructure - Dev Tool Start-Up)

$150,000 - $220,000 + Equity + Benefits + PTO
San Francisco, CA

Are you passionate about keeping production AI infrastructure fast, reliable, and self‑healing? Do you thrive in environments where you directly own the systems that millions of LLM requests flow through every day?

This is an opportunity to join a fast‑growing, profitable startup at the forefront of AI infrastructure, building the reliability layer that powers how real customers deploy and use language models in production. Backed by top‑tier investors and trusted by major enterprises, the team has built a unified LLM gateway used as a critical proxy by engineering teams worldwide. Now, they're looking for a founding SRE to own the reliability, performance, and observability of that proxy in production.

As a founding member of the engineering team, you'll take ownership of the systems keeping the core proxy alive under load, debugging OOMs, resolving database connection exhaustion, fixing race conditions, and making the platform resilient when dependencies go down. You'll work directly with senior leadership, engage with a large open source community, and ensure that when customers put their entire AI stack behind this gateway, it never lets them down.

If you're looking for a role where you can combine deep systems debugging with real customer impact and directly influence the infrastructure that underpins modern AI applications, this is an outstanding opportunity.

The Role:

Own and resolve production reliability issues including OOMs, deadlocks, connection pool exhaustion, and race conditions
Optimize performance across hot paths including spend tracking, database writes, and health checks
Improve Redis and in-memory cache reliability across multi‑pod deployments
Make the proxy self‑healing with graceful degradation, retry logic, and proper health checks when DB or Redis is unavailable
Build and maintain Prometheus metrics, alerting, and observability for production deployments
Collaborate directly with customers and the open source community to turn real‑world issues into platform improvements

The Person:

1-4 years running Python services in production at scale
Experience debugging OOMs, memory leaks, race conditions, and deadlocks in live environments
Strong familiarity with Postgre

SQL, Redis, and Kubernetes in live environments
Comfortable owning production systems and debugging customer‑facing incidents
Solid understanding of distributed systems, connection pooling, and caching layers
Excited to work in an early‑stage, high‑ownership, fast‑shipping environment

We are an equal opportunities company and welcome applications from all suitable candidates.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language