×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Hamilton Barnes ?
Full Time position
Listed on 2026-01-10
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, Network Engineer
Job Description & How to Apply Below

Direct message the job poster from Hamilton Barnes 🌳

Overview

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. You will support their exciting new products coming to the market and help shape the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities
  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Required Skills & Experience
  • Customer-facing experience with a proactive, versatile ā€œSwiss army knifeā€ attitude.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Information Technology

Industries

Staffing and Recruiting

Note:

This description reflects the job as posted and does not include boilerplate recruitment language not related to the role.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary