Sr Service Reliability Engineer
Listed on 2026-01-13
-
IT/Tech
Systems Engineer, Cloud Computing
Music is Universal
It’s the passionate and dedicated team at Universal Music who help make us the world’s leading music company. From A&R to finance, legal to digital, sales to marketing, Universal Music is the place to grow and develop your career within a truly commercial and innovative business that leads in everything it does.
Everyone is welcome to apply for our roles, and we are determined to ensure that no applicant or employee receives less favourable treatment because of gender, race, disability, sexual orientation, religion, belief, age, marital status, background, pregnancy, or caring responsibilities. We also recognise the importance of diversity of thought within our teams and are fully committed to embracing the talents of people with autism, dyslexia, ADHD, and other forms of neurocognitive variation.
We will always seek to make appropriate adjustments to recruitment, workplaces, and work processes to be fully inclusive to people with different needs and working styles. If you need us to make any reasonable adjustments for you from application onwards, including alternatives to the online form or to disclose a neurocognitive condition, please email Uni
Job SummaryAs a key member of our Global Technical Operations team, you will be the ultimate escalation point and subject matter expert for all SRE operations. This senior technical role requires a strategic mindset, deep expertise in System Reliability Engineering, and the ability to blend a software engineering mindset with operational expertise to engineer solutions that improve system reliability, automate complex processes, and reduce manual toil.
You will drive the operational strategy for SRE implementation at UMG and ensure the services that connect artists and fans around the globe are always on.
System Reliability & Performance:
- Design, build, and maintain the availability, scalability, and performance of critical services.
- Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS Cloud Watch, Dynatrace) to ensure rapid issue detection and resolution.
- Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
Automation & Efficiency:
- Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
- Create and maintain scripts and custom code to support and enhance our operational toolset.
- Support and optimize CI/CD pipelines to improve deployment speed and reliability.
Incident Management &
Collaboration:
- Participate in an on‑call rotation to troubleshoot and mitigate production incidents.
- Lead post‑incident reviews and root cause analyses to implement lasting solutions.
- Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
Advanced Escalation and Strategic Troubleshooting:
- Act as the final escalation point for SRE operations, leading resolution of complex, critical incidents and coordinating cross‑functional teams.
- Design, implement, and refine escalation management processes for the entire Global Technical Operations Center.
- Conduct deep‑dive root cause analysis for recurring, complex problems and develop long‑term solutions including automation and architectural changes.
Leadership & Mentoring:
- Serve as a technical leader and mentor to junior engineers.
- Develop and lead training sessions on advanced security concepts, threat landscapes, and internal best practices.
- Foster a culture of continuous learning and operational excellence within the team.
Architecture & Standards:
- Partner with Dev Ops and applications architects to influence and enforce standards, ensuring new and existing systems are built on Infrastructure as Code principles.
- Identify opportunities for network automation, scripting, and tool development to streamline operational tasks and improve efficiency.
- Create and maintain comprehensive documentation for configurations, SOPs, and incident response protocols.
Communication & Stakeholder Management:
- Communicate effectively with technical and non‑technical stakeholders, including senior…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: