Site Reliability Engineer – Own Reliability -Scale ML Platform
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2026-03-01
Listing for:
Stratitech Services LLC
Full Time
position Listed on 2026-03-01
Job specializations:
-
IT/Tech
Cloud Computing, SRE/Site Reliability, Systems Engineer, Network Engineer
Job Description & How to Apply Below
Senior/Staff Site Reliability Engineer (SRE)
Employment Type: W2 Only
Company: StratITech
Strat
ITech is hiring a Staff Site Reliability Engineer to help our client in San Francisco scale and harden a data-intensive platform powering machine learning, neural network workloads, and real-time analytics
.
This role is built for an SRE who grew up in Linux infrastructure
, evolved through Dev Ops, and now operates at a Staff level—setting technical direction, reducing systemic risk, and raising the reliability bar across teams.
You’ll be hands-on, highly visible, and deeply embedded in how ML systems run in production.
What You’ll Do- Build and maintain scalable Linux-based infrastructure supporting real-time analytics and ML workloads
- Improve system reliability and performance through automation, observability, and proactive capacity planning
- Own CI/CD pipelines
, deployment automation, rollback strategies, and configuration management for production systems - Implement and operate monitoring, alerting, SLOs, runbooks, and incident response processes for critical services
- Partner with engineering and data science teams to ensure ML workloads are production-ready and reliable by design
- Ensure security, compliance, and operational readiness across infrastructure and deployment workflows
- Lead post-incident reviews and drive measurable, long-term reliability improvements
- Deep experience operating Linux infrastructure
, systems, and networking in production - Proven impact as an SRE or Dev Ops Engineer supporting complex, distributed systems
- Practical understanding of machine learning systems and neural network workloads in production
- Hands-on experience with Docker and Kubernetes
- Strong scripting skills (
Bash and/or Python
) - Experience with observability tools (Prometheus, Grafana, Datadog, ELK, Open Telemetry)
- CI/CD pipeline ownership experience (Git Hub Actions, ArgoCD, or similar)
- Ability to debug systemic failures across infrastructure, deployments, and workloads
- Clear communicator who works effectively across engineering and data teams
- Experience supporting ML platforms at scale (training and inference)
- AWS or cloud-managed services experience
- Familiarity with data platforms such as Spark, Airflow, or Kafka
- Experience operating in SOC 2 or regulated environments
- Staff-level ownership of mission-critical infrastructure
- Direct influence over how ML and analytics systems run in production
- Engineering culture that values accountability, operational rigor, and impact
📩 Apply or message Strat
ITech to start the conversation.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×