×
Register Here to Apply for Jobs or Post Jobs. X

Senior Platform & Reliability Engineer; SRE

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Vizcom
Full Time position
Listed on 2026-03-01
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Salary/Wage Range or Industry Benchmark: 200000 - 250000 USD Yearly USD 200000.00 250000.00 YEAR
Job Description & How to Apply Below
Position: Senior Platform & Reliability Engineer (SRE)

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/Type Script frontend, Node/Koa + Post Graphile API services, Postgre

SQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.

We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.

Compensation

$200,000 – $250,000 base salary + meaningful equity

What You’ll Own

  • Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

  • Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.

  • Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.

  • Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

  • Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).

  • Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.

  • Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits We’re Looking For

  • Calm, structured incident commander under pressure.

  • Thinks in failure modes and blast radius by default.

  • Pragmatic: can stabilize quickly, then implement durable fixes.

  • High ownership and strong written communication.

First 90 Days
  • Establish baseline reliability metrics and identify top platform risks.

  • Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).

  • Deliver high-impact hardening fixes across probes/startup paths/queue safety.

  • Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to  :

1) what failed,

2) how you contained it,

3) what permanent fixes you shipped, and measured.

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary