Lead - Capacity & Automation; SRE
Listed on 2026-01-10
-
IT/Tech
Systems Engineer, Cloud Computing
Overview
Own the Private Cloud "EC.3" Capacity Management Platform - act as the single accountable owner for capacity planning, forecasting, modelling, and optimisation across the VMware-based Enterprise Cloud v3 environment. Define and Deliver the Capacity Roadmap - translate business demand and programme milestones into a prioritised backlog of features and automation, using Agile delivery practices. Implement SRE Guardrails - establish SLIs, SLOs, and error budgets for infrastructure-related reliability;
ensure proactive risk management. Develop Forecasting Models - build accurate short-, medium-, and long-term capacity forecasts using telemetry and scenario analysis to prevent saturation and ensure headroom. Automate Capacity Workflows - create scripts, policies, and integrations for rightsizing, placement, and quota enforcement using Power
CLI, APIs, and IaC. Maintain Real-Time Telemetry & Dashboards - provide a single source of truth for utilisation, trends, and optimisation opportunities through VMware Aria Operations (vROps) and reporting tools. Optimise Cost and Efficiency - align with Fin Ops principles to deliver show back/chargeback reporting, identify waste, and implement cost-saving measures without compromising reliability. Integrate with ITSM & Governance - ensure Service Now CMDB accuracy, automate request fulfilment, and maintain compliance with capacity policies and audit requirements.
Collaborate Across Teams - work closely with Architecture, Programme Delivery, Finance, and Operations to align capacity decisions with strategic objectives and risk appetite. Continuously Improve - evolve the capacity management capability through iterative enhancements, stakeholder feedback, and adoption of emerging best practices.
- Vision & Strategy - Define and communicate the long-term vision for capacity management on EC.3, ensuring alignment with business objectives and technology strategy.
- Ownership & Accountability - Act as the single point of accountability for capacity planning, forecasting, and optimisation across the VMware platform.
- Influence & Stakeholder Engagement - Build strong relationships with senior stakeholders, program leads, and cross-functional teams to drive decisions and secure buy-in.
- Agile Leadership - Champion Agile ways of working, ensuring backlog prioritisation, iterative delivery, and continuous improvement of the capacity capability.
- Reliability Governance - Embed SRE principles into leadership decisions, balancing innovation with risk management through SLIs, SLOs, and error budgets.
- Financial Stewardship - Lead cost optimisation initiatives aligned with Fin Ops principles, ensuring efficient use of resources and transparent reporting.
- Team Enablement - Mentor and guide engineers and analysts, fostering a culture of automation, data-driven decision-making, and operational excellence.
- Change Leadership - Drive adoption of new processes, tools, and automation across teams, ensuring smooth transitions and minimal disruption.
- Executive Communication - Provide clear, concise updates on capacity health, risks, and roadmap progress to senior leadership and governance boards.
- Continuous Improvement - Lead retrospectives and postmortems to identify systemic improvements and embed lessons learned into future planning.
- Capacity Headroom Policy - Define minimum thresholds for CPU, memory, and storage across clusters to ensure reliability and performance.
- Forecasting Approach - Select and implement the models and tools used for short-, medium-, and long-term capacity planning.
- Automation Priorities - Decide which manual processes to automate first (e.g., rightsizing, placement, quota enforcement) to reduce toil and improve efficiency.
- SLO & Error Budget Targets - Set reliability objectives for capacity-related metrics and determine acceptable risk levels for change management.
- Optimisation Strategy - Choose cost-saving measures (e.g., rightsizing, decommissioning, reserved capacity) while balancing performance and resilience.
- Tooling & Integration Choices - Determine which platforms (e.g., VMware Aria Operations, Service Now, Power BI) and scripts will form the core of the capacity management capability.
- Governance & Compliance Controls - Establish policies for capacity requests, approvals, and audit readiness.
- Reporting & Communication Cadence - Decide how often and in what format capacity health, risks, and forecasts are shared with stakeholders.
- Change Freeze & Risk Mitigation - Make calls on when to pause non-essential changes based on capacity risk or error budget breaches.
- Continuous Improvement Roadmap - Prioritise enhancements to forecasting accuracy, automation coverage, and stakeholder experience.
- Proven track record in capacity management for large-scale VMware environments (vSphere, vCenter, vSAN, NSX-T).
- Hands-on experience with VMware Aria Operations (vROps) or similar tools for capacity analytics, forecasting, and…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: