Sr. Site Reliability Engineer,Observability Job Fremont area,California USA,IT/Tech

What To Expect

You will be responsible for designing and building the enterprise-grade observability platform with a strong focus on metrics, providing end-to-end visibility and diagnostics across Tesla's infrastructure and applications. You will be part of the Observability team, which manages Tesla’s system observability and ensures visibility across global and internal applications, including digital, manufacturing, fleet, and Autopilot platforms. This role requires deep expertise in system engineering, Kubernetes deployments, metrics platforms (including Grafana Mimir or equivalent), and logging platform (Splunk).

You will be responsible for ensuring the availability, performance, and scalability of a large, distributed metrics infrastructure that processes over billion active time series.

What You'll Do

Build, deploy, scale, and maintain high-performance, multi-tenant, Prometheus-compatible monitoring systems that support over billions of active time series
Develop custom, tailored observability solutions to address unique Tesla's requirements
Monitor cluster health using observability dashboards, optimize query performance, tune ingestion pipelines, and scale storage infrastructure to support long-term metrics retention
Design and implement next-generation observability platforms (metrics and logs) with a focus on scalability, reliability, and high performance
Manage large-scale distributed Splunk cluster environments handling over 500TB+ of data daily
Collaborate with cross-functional teams, including SREs, architects, and other stakeholders, to understand complex application architectures and enable top-down monitoring strategies for comprehensive service visibility
Troubleshoot performance and access issues while managing metrics platforms (Grafana Mimir or equivalent), including installation and upgrades across clustered environments
Respond to and resolve support requests promptly while effectively balancing project timelines and competing priorities
Configure and manage CI/CD pipelines using tools such as Ansible and Git Hub Actions to streamline operations
Participate in an on-call rotation to support critical systems outside regular business hours

What You'll Bring

Strong hands-on experience with observability stacks including Grafana Mimir / Prometheus / cortex/ Thanos, or equivalent enterprise-grade metrics platforms
Deep expertise in Linux system internals, large-scale performance tuning, and system administration
Solid hands-on experience with Kubernetes configuration, networking, deployment, and multi-cluster HA architectures
Advanced proficiency in PromQL and SQL, with strong understanding of high-cardinality metrics, label design, and series explosion impacts on storage and query performance
Experience with distributed systems architecture, multi-region deployments, and high-availability cluster design
Hands-on experience with S3-compatible object storage and experience in distributed streaming systems Apache Kafka or Redpanda
Strong knowledge of monitoring and observability practices including Open Telemetry (OTLP), Protobuf, and Prometheus-based metrics collection
Experience configuring and tuning caching layers and managing authentication mechanisms (OAuth, reverse proxies, API gateways, mTLS)
Proven troubleshooting expertise and performance optimization experience in large-scale distributed metrics platforms;
Splunk administration is a plus
Strong scripting and automation skills (Python, Ansible, Git Hub Actions), excellent documentation practices, and participation in on-call and incident management processes

Benefits Compensation and Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:

Medical plans > plan options with $0 payroll deduction
Family-building, fertility, adoption and surrogacy benefits
Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
Healthcare and Dependent Care Flexible Spending Accounts (FSA)
401(k) with employer match, Employee…


Increase/decrease your Search Radius (miles)



Job Posting Language

Sr. Site Reliability Engineer, Observability