×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Raleigh, Wake County, North Carolina, 27601, USA
Listing for: BuildOps
Full Time position
Listed on 2026-02-24
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing
Salary/Wage Range or Industry Benchmark: 105000 - 130000 USD Yearly USD 105000.00 130000.00 YEAR
Job Description & How to Apply Below

Overview

At Build Ops, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire. You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure.

We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment. Reporting to the Dev Ops and SRE Manager, this is a hands-on role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.

What

You’ll Do
  • Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
  • Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
  • Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations
  • Help evolve and operate our AWS infrastructure (networking, compute, data stores) using Infrastructure as Code (Terraform)
  • Contribute code to services, tooling, and automation (for example, reliability libraries, deployment and incident tooling, health checks)
  • Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
  • Participate in incident response for infrastructure-related production issues, including learning-focused post-incident reviews and follow-through on action items
  • Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
  • Advocate for and implement security and compliance best practices in production environments
  • Document standards, playbooks, and best practices so reliability improvements scale across teams
  • Collaborate closely with software engineers, product managers, and other stakeholders to plan and deliver reliability-focused initiatives
What We Look For
  • 3+ years of professional experience in Site Reliability Engineering, Dev Ops, or Infrastructure Engineering, working on production systems and reliability-focused initiatives
Thorough understanding of and hands-on experience with modern SRE practices, such as:
  • Defining and implementing SLIs/SLOs and error budgets
  • Reducing toil through automation
  • Safe deployment and rollout patterns
  • Structured post-incident reviews and continuous improvement
  • Some software engineering experience required: you’ve written and maintained production-quality code and can work comfortably in at least one modern language (for example, Python or Node.js/Type Script)
  • Interested in using LLMs to assist in work, with at least some experience doing so
Strong observability skills
  • Designing metrics, logging, and tracing for multi-service systems
  • Building actionable dashboards and alerts with clear runbooks
  • Correlating metrics, logs, and traces to debug complex issues
  • Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor-agnostic experience is welcome)
  • Experience working with AWS in production and with core platform primitives such as Terraform-based Infrastructure as Code and container/orchestration platforms (for example, Docker with ECS, EKS, or Kubernetes)
Incident management experience is a strong plus, including:
  • Participating in or coordinating incident response
  • Working within an incident management tool (for example, incident.io, Pager Duty, Opsgenie, or similar)
  • Helping teams implement durable, high-leverage follow-ups
  • Strong communication skills and the ability to explain complex technical topics to both technical and non-technical audiences
  • CS degree or equivalent experience running production systems; we are equally interested in people from non-traditional backgrounds who have spent time operating real-world environments
  • Ability to work a hybrid schedule – Monday/Friday WFH;
    Tuesday–Thursday in-office
Compensation
  • $105,000 - $130,000 base salary range + annual bonus
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary