Principal Site Reliability Engineer - Automotive Job Durham area,North Carolina USA,IT/Tech

Red Hat is seeking a Principal Site Reliability Engineer based in Raleigh with a passion for designing, maintaining and continuously improving highly reliable infrastructure. In this role, you will architect, design and lead the implementation of the Red Hat In-Vehicle OS (RHIVOS) product SRE initiative. You will partner with development, quality engineering, and release engineering colleagues to assess and then uphold the health and well-being of the infrastructure hosting software production services.

The ideal candidate will possess deep-dive infrastructure and product reliability knowledge and a proven ability to drive the adoption of their technical vision across engineering teams.

What you will do:

Architect, design and lead the implementation of the RHIVOS product SRE initiative.
Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
Review team contributions to software correcting errors and provide constructive feedback.
Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
Configure and maintain software production infrastructure and tooling.
Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
Collaborate on incident retrospective reviews and corrective items implementation.
Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
Help out/backup RHIVOS Raleigh lab SRE when needed.

What you will bring:

8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
Linux administration expertise.
Advanced experience of Kubernetes/Open Shift administration and application development.
Advanced experience of automation services like Ansible or Terraform.
Advanced experience of CI/CD platforms like GItLab CI, Tekton and Pipelines as a code (optionally Git Hub Actions etc).
Advanced experience and experience with monitoring platforms and technologies.
Advanced experience and experience of AWS technologies.
Experience with open source monitoring technologies (Grafana, Prometheus, Open Telemetry).
Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team.
Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability.
Previous experience with the Site Reliability Engineer (SRE) model and software development using Python…


Increase/decrease your Search Radius (miles)



Job Posting Language