×
Register Here to Apply for Jobs or Post Jobs. X

Principal Site Reliability Engineer - Automotive

Job in Durham, Durham County, North Carolina, 27703, USA
Listing for: Red Hat, Inc.
Full Time position
Listed on 2026-02-28
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Project Manager
Salary/Wage Range or Industry Benchmark: 151510 - 249950 USD Yearly USD 151510.00 249950.00 YEAR
Job Description & How to Apply Below

Red Hat is seeking a Principal Site Reliability Engineer based in Raleigh with a passion for designing, maintaining and continuously improving highly reliable infrastructure. In this role, you will architect, design and lead the implementation of the Red Hat In-Vehicle OS (RHIVOS) product SRE initiative. You will partner with development, quality engineering, and release engineering colleagues to assess and then uphold the health and well-being of the infrastructure hosting software production services.

The ideal candidate will possess deep-dive infrastructure and product reliability knowledge and a proven ability to drive the adoption of their technical vision across engineering teams.

What you will do:
  • Architect, design and lead the implementation of the RHIVOS product SRE initiative.
  • Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
  • Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
  • Review team contributions to software correcting errors and provide constructive feedback.
  • Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
  • Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
  • Configure and maintain software production infrastructure and tooling.
  • Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
  • Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
  • Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
  • Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
  • Collaborate on incident retrospective reviews and corrective items implementation.
  • Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
  • Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
  • Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
  • Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
  • Help out/backup RHIVOS Raleigh lab SRE when needed.
What you will bring:
  • 8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
  • Linux administration expertise.
  • Advanced experience of Kubernetes/Open Shift administration and application development.
  • Advanced experience of automation services like Ansible or Terraform.
  • Advanced experience of CI/CD platforms like GItLab CI, Tekton and Pipelines as a code (optionally Git Hub Actions etc).
  • Advanced experience and experience with monitoring platforms and technologies.
  • Advanced experience and experience of AWS technologies.
  • Experience with open source monitoring technologies (Grafana, Prometheus, Open Telemetry).
  • Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team.
  • Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability.
  • Previous experience with the Site Reliability Engineer (SRE) model and software development using Python…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary