Site Reliability Engineer – Own Reliability -Scale ML Platform Job San Francisco area,California USA,IT/Tech

Position: Staff Site Reliability Engineer – Own Reliability for High-Scale ML Platform

Senior/Staff Site Reliability Engineer (SRE)

Employment Type: W2 Only

Company: StratITech

Strat

ITech is hiring a Staff Site Reliability Engineer to help our client in San Francisco scale and harden a data-intensive platform powering machine learning, neural network workloads, and real-time analytics
.

This role is built for an SRE who grew up in Linux infrastructure
, evolved through Dev Ops, and now operates at a Staff level—setting technical direction, reducing systemic risk, and raising the reliability bar across teams.

You’ll be hands-on, highly visible, and deeply embedded in how ML systems run in production.

What You’ll Do

Build and maintain scalable Linux-based infrastructure supporting real-time analytics and ML workloads
Improve system reliability and performance through automation, observability, and proactive capacity planning
Own CI/CD pipelines
, deployment automation, rollback strategies, and configuration management for production systems
Implement and operate monitoring, alerting, SLOs, runbooks, and incident response processes for critical services
Partner with engineering and data science teams to ensure ML workloads are production-ready and reliable by design
Ensure security, compliance, and operational readiness across infrastructure and deployment workflows
Lead post-incident reviews and drive measurable, long-term reliability improvements

What We’re Looking For

Deep experience operating Linux infrastructure
, systems, and networking in production
Proven impact as an SRE or Dev Ops Engineer supporting complex, distributed systems
Practical understanding of machine learning systems and neural network workloads in production
Hands-on experience with Docker and Kubernetes
Strong scripting skills (
Bash and/or Python
)
Experience with observability tools (Prometheus, Grafana, Datadog, ELK, Open Telemetry)
CI/CD pipeline ownership experience (Git Hub Actions, ArgoCD, or similar)
Ability to debug systemic failures across infrastructure, deployments, and workloads
Clear communicator who works effectively across engineering and data teams

Nice to Have

Experience supporting ML platforms at scale (training and inference)
AWS or cloud-managed services experience
Familiarity with data platforms such as Spark, Airflow, or Kafka
Experience operating in SOC 2 or regulated environments

Why This Role

Staff-level ownership of mission-critical infrastructure
Direct influence over how ML and analytics systems run in production
Engineering culture that values accountability, operational rigor, and impact

📩 Apply or message Strat

ITech to start the conversation.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language