×
Register Here to Apply for Jobs or Post Jobs. X

Senior SaaS Platform Reliability Engineer

Job in Basingstoke, Hampshire County, RG21, England, UK
Listing for: Once For All Limited
Full Time position
Listed on 2026-01-20
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 80000 - 100000 GBP Yearly GBP 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Once For All is a high-growth, cloud-based SaaS subscription business helping organisations manage supply chain governance, risk management, and compliance
. We support over 250,000 customers across the UK
, across more than 20 public and private sector industries
, including construction, transport, retail, hospitality, education, facilities management, manufacturing, and central and local government.

Role Summary

Join our engineering team as a Senior SaaS Platform Reliability Engineer
, taking a foundational role in establishing and maturing reliability engineering practices across our SaaS platform.

This is not an advisory or oversight-only role
. You will personally contribute to reliability improvements
, working in partnership with our SCRUM-based product teams and our Dev Ops and Cloud engineering teams to ensure those improvements are delivered safely into production.

You will help define SLIs, SLOs, and error budgets for tier-1, customer-facing services, and contribute to how reliability trade-offs are made and communicated. The role focuses on improving and evolving existing systems
—observability, alerting, performance tooling, automation, and testing—rebuilding components where necessary to better support SLO-driven reliability.

This role is fully remote
, working within UK time zones.

What "Hands-On" Means in this role

To be explicit:

  • This is not a strategy-only role
    . You will personally contribute reliability improvements
    , working alongside product and platform teams to see them delivered into production.

  • Expect to spend ~60–80% of your time building and operating
    , including:

    • Instrumentation, dashboards, and alerting

    • Automation to reduce operational toil

    • Release safety mechanisms and guardrails

  • You will be an active on-call contributor for tier-1 services and lead incident response end-to-end (triage → mitigation → permanent fix → postmortem).

  • You will write production code (not only infrastructure-as-code) to improve reliability, availability, and performance.

  • You will build on and improve existing observability, alerting, performance, automation, and test systems
    , rebuilding components where necessary rather than starting from scratch.

  • Success in this role is measured by outcomes
    , such as:

    • Clear and trusted SLO reporting

    • Reduced alert noise

    • Improved p95 / p99 latency

    • Fewer production regressions and rollbacks

Job Responsibilities Reliability & SLO Ownership
  • Define user-centric SLIs and SLOs for critical customer journeys.

  • Define and maintain error budgets
    , and work with sprint teams to guide how error budgets are understood and spent.

  • Provide visibility and expertise to support reliability-focused trade-offs while sprint teams remain accountable for delivery.

  • Own reliability outcomes for tier-1, customer-facing services
    .

Observability & Alerting
  • Evaluate, design, and improve end-to-end observability across metrics, logs, and traces.

  • Improve existing alerting to focus on customer impact rather than infrastructure noise
    .

  • Continuously refine signals to improve detection, reduce false positives, and minimise alert fatigue.

Incident Response & Learning
  • Actively participate in and lead major incident response when required.

  • Run blameless postmortems focused on systemic improvement.

  • Track and improve MTTR, MTTD, and incident recurrence over time.

Automation & Toil Reduction
  • Identify operational toil and reduce it through engineering and automation
    .

  • Improve and extend existing automation and operational tooling.

  • Treat reliability issues as software problems
    , not process gaps.

Availability, Performance & Scalability
  • Contribute to zone- and region-aware architectures on Azure and AKS.

  • Perform capacity planning and load testing aligned to SLOs
    .

  • Improve p95 and p99 latency
    , throughput, and behaviour under failure conditions.

Release Safety & Regression Prevention
  • Improve safe release mechanisms including canary, blue/green, and progressive delivery.

  • Strengthen detection and rollback strategies for faulty releases.

  • Partner with teams to improve test coverage where it directly protects production reliability
    .

Platform & Infrastructure
  • Contribute to infrastructure as code using Terraform or Bicep
    .

  • Help enforce standards and guardrails via poli…

Position Requirements
10+ Years work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary