Sr. Site Reliability Engineer, Observability
Listed on 2026-02-28
-
IT/Tech
Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
What To Expect
You will be responsible for designing and building the enterprise-grade observability platform with a strong focus on metrics, providing end-to-end visibility and diagnostics across Tesla's infrastructure and applications. You will be part of the Observability team, which manages Tesla’s system observability and ensures visibility across global and internal applications, including digital, manufacturing, fleet, and Autopilot platforms. This role requires deep expertise in system engineering, Kubernetes deployments, metrics platforms (including Grafana Mimir or equivalent), and logging platform (Splunk).
You will be responsible for ensuring the availability, performance, and scalability of a large, distributed metrics infrastructure that processes over billion active time series.
- Build, deploy, scale, and maintain high-performance, multi-tenant, Prometheus-compatible monitoring systems that support over billions of active time series
- Develop custom, tailored observability solutions to address unique Tesla's requirements
- Monitor cluster health using observability dashboards, optimize query performance, tune ingestion pipelines, and scale storage infrastructure to support long-term metrics retention
- Design and implement next-generation observability platforms (metrics and logs) with a focus on scalability, reliability, and high performance
- Manage large-scale distributed Splunk cluster environments handling over 500TB+ of data daily
- Collaborate with cross-functional teams, including SREs, architects, and other stakeholders, to understand complex application architectures and enable top-down monitoring strategies for comprehensive service visibility
- Troubleshoot performance and access issues while managing metrics platforms (Grafana Mimir or equivalent), including installation and upgrades across clustered environments
- Respond to and resolve support requests promptly while effectively balancing project timelines and competing priorities
- Configure and manage CI/CD pipelines using tools such as Ansible and Git Hub Actions to streamline operations
- Participate in an on-call rotation to support critical systems outside regular business hours
- Strong hands-on experience with observability stacks including Grafana Mimir / Prometheus / cortex/ Thanos, or equivalent enterprise-grade metrics platforms
- Deep expertise in Linux system internals, large-scale performance tuning, and system administration
- Solid hands-on experience with Kubernetes configuration, networking, deployment, and multi-cluster HA architectures
- Advanced proficiency in PromQL and SQL, with strong understanding of high-cardinality metrics, label design, and series explosion impacts on storage and query performance
- Experience with distributed systems architecture, multi-region deployments, and high-availability cluster design
- Hands-on experience with S3-compatible object storage and experience in distributed streaming systems Apache Kafka or Redpanda
- Strong knowledge of monitoring and observability practices including Open Telemetry (OTLP), Protobuf, and Prometheus-based metrics collection
- Experience configuring and tuning caching layers and managing authentication mechanisms (OAuth, reverse proxies, API gateways, mTLS)
- Proven troubleshooting expertise and performance optimization experience in large-scale distributed metrics platforms;
Splunk administration is a plus - Strong scripting and automation skills (Python, Ansible, Git Hub Actions), excellent documentation practices, and participation in on-call and incident management processes
Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:
- Medical plans > plan options with $0 payroll deduction
- Family-building, fertility, adoption and surrogacy benefits
- Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
- Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
- Healthcare and Dependent Care Flexible Spending Accounts (FSA)
- 401(k) with employer match, Employee…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).