Senior Manager,Site Reliability Engineering Job Toronto area,Ontario Canada,IT/Tech

Senior Manager, Site Reliability Engineering About Tubi:

Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of Hollywood movies and TV shows, thousands of creator-led stories and hundreds of Tubi Originals made for the most passionate fans. Headquartered in San Francisco and founded in 2014, Tubi is part of Tubi Media Group, a division of Fox Corporation.

About the Role:

Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developer's mindset and toolkit to the challenges of building and running large-scale, distributed systems. Our mission is to engineer resilience from the ground up, enabling our product teams to innovate rapidly while ensuring our users have a stellar experience. We own the availability, latency, performance, and capacity of our platform, and we achieve our goals through a culture of data-driven decision-making, blameless learning, and relentless automation.

What

You'll Do:

Lead, mentor, and grow a team of Site Reliability Engineers. Foster a culture of innovation and technical excellence where engineers feel empowered to do their best work. Provide personalized coaching, create professional development plans, and guide the careers of senior and emerging talent within the team.
Establish equitable, sustainable on-call practices (including global coverage where applicable) that protect focus time and avoid burnout.
Define team rituals - runbook reviews, game days, and incident retros - that reinforce quality and learning.
Strategic Planning & Vision: Define and drive the multi-year technical strategy and vision for Tubi’s observability, and automation platforms. Partner with infra lead to align Tubi’s infrastructure & SRE roadmap. Partner with tech leaders to align the SRE roadmap with business objectives. Champion a data-driven approach to reliability, using Service Level Objectives (SLOs) and error budgets to facilitate productive conversations about risk and feature velocity.
Operational Excellence & Incident Management:
Own the end-to-end availability, performance, and efficiency of our critical user-facing services. Evolve our incident response practice to reduce Mean Time to Resolution (MTTR) and Mean Time Between Failures (MTBF). Champion a rigorous, blameless, and data-driven post-mortem culture to ensure we learn from both successes and failures, driving eng teams for systemic fixes and automation to prevent the recurrence of incidents.
Streamline and improve our existing processes and practices, and collaborate with other teams to enhance our production release standards by improving current processes.
Define and tune a 24×7 on-call rotation for low noise and fast response; act as executive escalation partner during major incidents.
Own disaster-recovery strategy (playbooks, failover drills, recovery simulations) and track SLO gaps with time-bound remediations.
Financial & Vendor Management: Own the SRE budget, tooling, and headcount. Manage relationships with key third-party vendors for our observability and SRE related AI platforms, work with infra lead and finance team for contract negotiations and ensure we derive maximum value from our investments.
Cross-Functional Collaboration: Act as a key influencer and strategic partner to leaders in Software Engineering, Product Management, and Infra/Sec. Drive the adoption of SRE best practices and principles throughout the organization, ensuring new services are designed for reliability, scalability, and observability from day one.

Your Background:

8+ years of experience in a technical field, with at least 3+ years in an engineering leadership position managing SRE, Dev Ops, or Production Engineering teams.
A deep, principled understanding of SRE tenets, including Service Level Indicators (SLIs), SLOs, error budgets, toil reduction, and capacity planning.
Exceptional communication, negotiation, and influencing skills, with the ability to articulate complex technical concepts and strategies to both technical and non-technical stakeholders…


Increase/decrease your Search Radius (miles)



Job Posting Language