×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer – Own Reliability -Scale ML Platform

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Stratitech Services LLC
Full Time position
Listed on 2026-03-01
Job specializations:
  • IT/Tech
    Cloud Computing, SRE/Site Reliability, Systems Engineer, Network Engineer
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below
Position: Staff Site Reliability Engineer – Own Reliability for High-Scale ML Platform

Senior/Staff Site Reliability Engineer (SRE)

Employment Type: W2 Only

Company: StratITech

Strat

ITech is hiring a Staff Site Reliability Engineer to help our client in San Francisco scale and harden a data-intensive platform powering machine learning, neural network workloads, and real-time analytics
.

This role is built for an SRE who grew up in Linux infrastructure
, evolved through Dev Ops, and now operates at a Staff level—setting technical direction, reducing systemic risk, and raising the reliability bar across teams.

You’ll be hands-on, highly visible, and deeply embedded in how ML systems run in production.

What You’ll Do
  • Build and maintain scalable Linux-based infrastructure supporting real-time analytics and ML workloads
  • Improve system reliability and performance through automation, observability, and proactive capacity planning
  • Own CI/CD pipelines
    , deployment automation, rollback strategies, and configuration management for production systems
  • Implement and operate monitoring, alerting, SLOs, runbooks, and incident response processes for critical services
  • Partner with engineering and data science teams to ensure ML workloads are production-ready and reliable by design
  • Ensure security, compliance, and operational readiness across infrastructure and deployment workflows
  • Lead post-incident reviews and drive measurable, long-term reliability improvements
What We’re Looking For
  • Deep experience operating Linux infrastructure
    , systems, and networking in production
  • Proven impact as an SRE or Dev Ops Engineer supporting complex, distributed systems
  • Practical understanding of machine learning systems and neural network workloads in production
  • Hands-on experience with Docker and Kubernetes
  • Strong scripting skills (
    Bash and/or Python
    )
  • Experience with observability tools (Prometheus, Grafana, Datadog, ELK, Open Telemetry)
  • CI/CD pipeline ownership experience (Git Hub Actions, ArgoCD, or similar)
  • Ability to debug systemic failures across infrastructure, deployments, and workloads
  • Clear communicator who works effectively across engineering and data teams
Nice to Have
  • Experience supporting ML platforms at scale (training and inference)
  • AWS or cloud-managed services experience
  • Familiarity with data platforms such as Spark, Airflow, or Kafka
  • Experience operating in SOC 2 or regulated environments
Why This Role
  • Staff-level ownership of mission-critical infrastructure
  • Direct influence over how ML and analytics systems run in production
  • Engineering culture that values accountability, operational rigor, and impact

📩 Apply or message Strat

ITech to start the conversation.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary