Service Reliability
Listed on 2026-02-25
-
IT/Tech
Systems Engineer, IT Project Manager
Overview
Job Title:
Senior Recovery Lead and Global Head of Service Reliability
Location:
Sheffield (Hybrid)
6 month Contract
Service Management (SM)
Service Management’s purpose is to protect the availability, integrity and confidentiality of IT Services that underpin customer and colleagues experience of the brand. It is a multi-functional team comprising of Change Management, Incident Management, Problem Management, Service Level Management, Outage Management, Service Recovery and Service Insights and Reporting.
About the Role
We are seeking a senior technology leader to take on the dual role of Senior Recovery Lead and Global Head of Service Reliability. This is a highly visible, high-impact position reporting to the Global Head of Service Management, with a mandate to transform how we recover from incidents and build long-term service resilience.
This individual will lead a global team of technical experts who act as technical escalation partners during major incidents—helping reduce time to recover (TTR) through deep technical engagement, coordination, and engineering-driven solutions. Beyond recovery, this leader will also own the strategic and tactical roadmap for building reliable, self-healing systems through collaboration with Problem Management, SRE, and Platform teams.
Key Responsibilities- Lead a global, follow-the-sun team that acts as technical escalation partners during major incidents.
- Partner with Incident Managers and Service Owners to accelerate incident diagnosis and resolution, reducing TTR and restoring services quickly and safely.
- Bring calm, coordination, and engineering clarity to high-pressure recovery efforts.
- Collaborate with Problem Managers, Product SRE, and Platform Engineering teams to identify and eliminate systemic causes of major incidents.
- Own and drive long-term remediation plans, including automation, reliability engineering, and platform guardrails to reduce future risk.
- Track and govern follow-up actions to ensure completeness, accountability, and measurable reduction in incident recurrence.
- Define and implement strategies for resilience engineering, including self-healing capabilities, automation of recovery workflows, and risk mitigation patterns.
- Advocate for operational excellence by embedding reliability standards, testing practices, and continuous improvement processes into engineering workflows.
- Partner with Architecture and Engineering leaders to influence system design with reliability in mind.
- Own the global incident scenario planning framework, ensuring that Technology is prepared to recover from widespread, complex failures.
- Design and run mass recovery simulations, chaos testing, and resilience drills to expose weaknesses and improve readiness.
- Work with regional and global risk teams to align with regulatory and operational resilience requirements.
- Build, scale, and lead a high-performing global team with deep technical skills and a culture of urgency, ownership, and collaboration.
- Drive a blameless, learning-focused culture that emphasizes root cause thinking, accountability, and continuous improvement.
- Act as a trusted partner and thought leader across Engineering, Infrastructure, Risk, and Service Management functions.
- 12+ years in Technology, with proven experience in Site Reliability Engineering, Infrastructure, Dev Ops, or Technical Operations.
- Demonstrated experience leading global technical teams in complex, high-scale environments.
- Deep expertise in incident recovery, automation, systems design, and platform reliability.
- Strong working knowledge of problem management, root cause analysis frameworks, and resilience engineering principles.
- Experience designing and running resilience exercises, chaos engineering, or incident scenario testing at scale.
- Comfortable operating in regulated environments and partnering with Risk and Compliance functions.
- Excellent stakeholder management and communication skills, with the ability to lead through influence at senior levels.
- Technical Depth – Ability to dive deep across infrastructure, applications, and cloud-native architectures.
- Recovery Leadership – Skilled in coordinating technical resources under pressure to resolve incidents rapidly.
- Reliability Thinking – Strategic mindset focused on system robustness, automation, and prevention.
- Change Agent – Drives cultural and engineering change to improve stability and accountability.
- Cross-Functional Collaboration – Adept at aligning goals and actions across engineering, operations, and risk domains.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: