Site Reliability Engineer
Listed on 2026-02-28
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing
Site Reliability Engineer (AI Infrastructure - Dev Tool Start-Up)
$150,000 - $220,000 + Equity + Benefits + PTO
San Francisco, CA
Are you passionate about keeping production AI infrastructure fast, reliable, and self‑healing? Do you thrive in environments where you directly own the systems that millions of LLM requests flow through every day?
This is an opportunity to join a fast‑growing, profitable startup at the forefront of AI infrastructure, building the reliability layer that powers how real customers deploy and use language models in production. Backed by top‑tier investors and trusted by major enterprises, the team has built a unified LLM gateway used as a critical proxy by engineering teams worldwide. Now, they're looking for a founding SRE to own the reliability, performance, and observability of that proxy in production.
As a founding member of the engineering team, you'll take ownership of the systems keeping the core proxy alive under load, debugging OOMs, resolving database connection exhaustion, fixing race conditions, and making the platform resilient when dependencies go down. You'll work directly with senior leadership, engage with a large open source community, and ensure that when customers put their entire AI stack behind this gateway, it never lets them down.
If you're looking for a role where you can combine deep systems debugging with real customer impact and directly influence the infrastructure that underpins modern AI applications, this is an outstanding opportunity.
The Role:- Own and resolve production reliability issues including OOMs, deadlocks, connection pool exhaustion, and race conditions
- Optimize performance across hot paths including spend tracking, database writes, and health checks
- Improve Redis and in-memory cache reliability across multi‑pod deployments
- Make the proxy self‑healing with graceful degradation, retry logic, and proper health checks when DB or Redis is unavailable
- Build and maintain Prometheus metrics, alerting, and observability for production deployments
- Collaborate directly with customers and the open source community to turn real‑world issues into platform improvements
- 1-4 years running Python services in production at scale
- Experience debugging OOMs, memory leaks, race conditions, and deadlocks in live environments
- Strong familiarity with Postgre
SQL, Redis, and Kubernetes in live environments - Comfortable owning production systems and debugging customer‑facing incidents
- Solid understanding of distributed systems, connection pooling, and caching layers
- Excited to work in an early‑stage, high‑ownership, fast‑shipping environment
We are an equal opportunities company and welcome applications from all suitable candidates.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).