Capacity and Performance Reliability Manager Job London area,Greater London England UK,IT/Tech

Location: Greater London

Capacity and Performance Reliability Manager
** Shift Pattern:
** Standard 40 Hour Week (United Kingdom)
** Scheduled Weekly

Hours:

** 40
* * Corporate Grade:
** D - Assistant Vice President
** Reporting Line:**(UK Division) Information Technology
*
* Location:

** UK-London
** Worker Type:
** Permanent
* * About the London Metal Exchange
** The London Metal Exchange (LME) is the world centre for industrial metals trading. Most of the world’s global non-ferrous futures business is conducted on the LME’s three trading platforms totalling $18 trillion, 178 million lots and 4 billion tonnes with a market open interest high of 1.8 million lots in 2024.The metals community uses the LME, an HKEX Group company, as a venue to transfer or take on price risk, as a physical market of last resort and as the provider of transparent global reference prices.
** Overall Purpose of Role:
** Capacity Management at the LME is a key function, linked to strict regulatory compliance requirements, to actively manage multiple environments. With a large virtual estate encompassing multiple VMWare Clusters and Open Shift Containers Platform (OCP), the Capacity and Performance Reliability Engineer is key to ensure the stability of the platforms.

The Capacity and Performance Reliability Engineer is responsible for ensuring the reliability, availability, and performance of all infrastructure and services, proactively identifying and mitigating risks, and driving continuous improvement in operational resilience and service quality. This includes maintenance of the capacity management tool suite, capacity reporting, trend analysis and forecasting, Ad-hoc performance investigations, demand management, and governance of the relevant processes and policies.

The Capacity and Performance Reliability Engineer must have extensive knowledge of trading technologies and the operation of a trading value, with the ability to incorporate business metrics and knowledge into the technical metrics from the LME core systems.
** Responsibilities
* *** Capacity Planning & Performance Management
*** Use historical data and predictive analytics to forecast demand and plan capacity for all environments (virtual, containerised, and physical).
* Perform stress testing, scenario modelling, and performance tuning to ensure systems can handle peak loads.
* Automate scaling, resource allocation, and infrastructure provisioning using Infrastructure as Code (IaC) and cloud-native tools.as Code (IaC) and cloud-native tools.
* Maintain and enhance the Capacity Management tool suite (e.g., Athene, Grafana), ensuring zero data loss and maximum automation.
** Collaboration & Continuous Improvement
*** Work closely with development, operations, and business teams to embed reliability and capacity considerations into system design and delivery.
* Promote best practices in automation, observability, and incident management.
* Present findings, reports, and recommendations to business heads, service managers, and technical teams.
* Build relationships with internal and external stakeholders, including architects, testing teams, service managers, project sponsors, and third-party suppliers.
** Metrics, Reporting & Governance
*** Produce regular service and infrastructure capacity plans, reliability reports, and recommendations for action.
* Own and manage the Capacity Management Recommendations tracker.
* Report on reliability metrics, incidents, and system health to senior management.
* Ensure compliance with regulatory requirements and internal governance standards.
** Reliability Engineering & System Health
*** Develop, implement, and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for critical services.
* Design and manage monitoring, alerting, and observability solutions to detect and prevent failures.
* Lead incident response, conduct blameless post-incident reviews, and drive corrective actions to prevent recurrence.
* Champion a reliability-focused, automation-first culture across teams.
** Professional Qualifications

Required:

*** Educated to degree standard and/or 5+ years of performance and capacity experience
* ITIL Foundation Certification
*…


Increase/decrease your Search Radius (miles)



Job Posting Language