×
Register Here to Apply for Jobs or Post Jobs. X

Sr. Site Reliability Engineer, Observability

Job in Fremont, Alameda County, California, 94537, USA
Listing for: Tesla
Full Time position
Listed on 2026-02-28
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

What To Expect

You will be responsible for designing and building the enterprise-grade observability platform with a strong focus on metrics, providing end-to-end visibility and diagnostics across Tesla's infrastructure and applications. You will be part of the Observability team, which manages Tesla’s system observability and ensures visibility across global and internal applications, including digital, manufacturing, fleet, and Autopilot platforms. This role requires deep expertise in system engineering, Kubernetes deployments, metrics platforms (including Grafana Mimir or equivalent), and logging platform (Splunk).

You will be responsible for ensuring the availability, performance, and scalability of a large, distributed metrics infrastructure that processes over billion active time series.

What You'll Do
  • Build, deploy, scale, and maintain high-performance, multi-tenant, Prometheus-compatible monitoring systems that support over billions of active time series
  • Develop custom, tailored observability solutions to address unique Tesla's requirements
  • Monitor cluster health using observability dashboards, optimize query performance, tune ingestion pipelines, and scale storage infrastructure to support long-term metrics retention
  • Design and implement next-generation observability platforms (metrics and logs) with a focus on scalability, reliability, and high performance
  • Manage large-scale distributed Splunk cluster environments handling over 500TB+ of data daily
  • Collaborate with cross-functional teams, including SREs, architects, and other stakeholders, to understand complex application architectures and enable top-down monitoring strategies for comprehensive service visibility
  • Troubleshoot performance and access issues while managing metrics platforms (Grafana Mimir or equivalent), including installation and upgrades across clustered environments
  • Respond to and resolve support requests promptly while effectively balancing project timelines and competing priorities
  • Configure and manage CI/CD pipelines using tools such as Ansible and Git Hub Actions to streamline operations
  • Participate in an on-call rotation to support critical systems outside regular business hours
What You'll Bring
  • Strong hands-on experience with observability stacks including Grafana Mimir / Prometheus / cortex/ Thanos, or equivalent enterprise-grade metrics platforms
  • Deep expertise in Linux system internals, large-scale performance tuning, and system administration
  • Solid hands-on experience with Kubernetes configuration, networking, deployment, and multi-cluster HA architectures
  • Advanced proficiency in PromQL and SQL, with strong understanding of high-cardinality metrics, label design, and series explosion impacts on storage and query performance
  • Experience with distributed systems architecture, multi-region deployments, and high-availability cluster design
  • Hands-on experience with S3-compatible object storage and experience in distributed streaming systems Apache Kafka or Redpanda
  • Strong knowledge of monitoring and observability practices including Open Telemetry (OTLP), Protobuf, and Prometheus-based metrics collection
  • Experience configuring and tuning caching layers and managing authentication mechanisms (OAuth, reverse proxies, API gateways, mTLS)
  • Proven troubleshooting expertise and performance optimization experience in large-scale distributed metrics platforms;
    Splunk administration is a plus
  • Strong scripting and automation skills (Python, Ansible, Git Hub Actions), excellent documentation practices, and participation in on-call and incident management processes
Benefits Compensation and Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:

  • Medical plans > plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • 401(k) with employer match, Employee…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary