Site Reliability Engineer Job Raleigh area,North Carolina USA,IT/Tech

Overview

At Build Ops, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire. You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure.

We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment. Reporting to the Dev Ops and SRE Manager, this is a hands-on role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.

What

You’ll Do

Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations
Help evolve and operate our AWS infrastructure (networking, compute, data stores) using Infrastructure as Code (Terraform)
Contribute code to services, tooling, and automation (for example, reliability libraries, deployment and incident tooling, health checks)
Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
Participate in incident response for infrastructure-related production issues, including learning-focused post-incident reviews and follow-through on action items
Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
Advocate for and implement security and compliance best practices in production environments
Document standards, playbooks, and best practices so reliability improvements scale across teams
Collaborate closely with software engineers, product managers, and other stakeholders to plan and deliver reliability-focused initiatives

What We Look For

3+ years of professional experience in Site Reliability Engineering, Dev Ops, or Infrastructure Engineering, working on production systems and reliability-focused initiatives

Thorough understanding of and hands-on experience with modern SRE practices, such as:

Defining and implementing SLIs/SLOs and error budgets
Reducing toil through automation
Safe deployment and rollout patterns
Structured post-incident reviews and continuous improvement

Some software engineering experience required: you’ve written and maintained production-quality code and can work comfortably in at least one modern language (for example, Python or Node.js/Type Script)
Interested in using LLMs to assist in work, with at least some experience doing so

Strong observability skills

Designing metrics, logging, and tracing for multi-service systems
Building actionable dashboards and alerts with clear runbooks
Correlating metrics, logs, and traces to debug complex issues

Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor-agnostic experience is welcome)
Experience working with AWS in production and with core platform primitives such as Terraform-based Infrastructure as Code and container/orchestration platforms (for example, Docker with ECS, EKS, or Kubernetes)

Incident management experience is a strong plus, including:

Participating in or coordinating incident response
Working within an incident management tool (for example, incident.io, Pager Duty, Opsgenie, or similar)
Helping teams implement durable, high-leverage follow-ups

Strong communication skills and the ability to explain complex technical topics to both technical and non-technical audiences
CS degree or equivalent experience running production systems; we are equally interested in people from non-traditional backgrounds who have spent time operating real-world environments
Ability to work a hybrid schedule – Monday/Friday WFH;
Tuesday–Thursday in-office

Compensation

$105,000 - $130,000 base salary range + annual bonus

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language