Technical Program Manager – GenAI Ops & Capacity Planning
Listed on 2026-02-23
-
IT/Tech
AI Engineer, Data Science Manager, Data Analyst, Cloud Computing
At Databricks, we are passionate about enabling data teams to solve the world’s toughest problems — from making the next mode of transportation a reality to accelerating medical breakthroughs. We do this by building and operating the world’s best data and AI infrastructure platform so our customers can turn deep data insights into real business impact. Founded by engineers and deeply customer-obsessed, we thrive on solving hard technical challenges, from next-generation data experiences to operating infrastructure at massive global scale.
And we’re only getting started. For more information, visit
Databricks is looking for a Staff Technical Program Manager to drive GenAI Operations and Capacity Planning for our large-scale LLM and GPU-backed platform. This role is designed for a senior, hands-on TPM who thrives in technically deep, data-driven environments and enjoys owning complex operational programs end to end.
As a Staff TPM, you will own execution for critical GenAI operational initiatives, operate with significant autonomy, and partner closely with AI/ML engineering, infrastructure, finance, partner ops and cloud/LLM providers. You will use strong analytical skills to guide decisions, surface risks, and continuously improve how Databricks launches, scales, and governs GenAI workloads.
You will report to a Technical Program Leader and operate across multiple time zones in a fast-moving, highly ambiguous environment.
What You’ll Do GenAI & LLM Operations- Plan and execute day-0 launches of new LLM models on Databricks
, ensuring production readiness across engineering, commercialization, go-to-market, legal and cloud service partners - Partner with AI/ML and platform engineering teams to operationalize LLM onboarding, rollout, and lifecycle management
. - Define and maintain launch checklists, operational runbooks, and success metrics for GenAI workloads.
- Own GPU and LLM capacity planning, forecasting, and allocation for GenAI workloads.
- Build and maintain SQL-driven analytical models and dashboards to forecast demand, track utilization, and surface capacity risks.
- Balance customer demand, growth trajectories, and contractual commitments to inform short- and medium-term capacity decisions.
- Track and drive efficient consumption of GPU and LLM capacity
, identifying under utilization, contention, and inefficiencies. - Define and monitor KPIs for utilization, efficiency, and reliability of GenAI platforms.
- Use data to recommend improvements to engineering roadmaps, operational processes, and cost optimization efforts.
- Execute governance mechanisms to ensure GenAI capacity usage aligns with contractual, financial, and compliance requirements
. - Produce clear, data-backed reporting for senior leaders on capacity health, utilization trends, and operational risks.
- Generate consumption reports, usage metrics reporting and share of wallet attestations
- Ensure documentation, controls, and processes are audit-ready and consistently followed.
Minimum Qualifications
- 10+ years of overall industry experience
, including 7+ years in Technical Program Management
. - Experience leading cross-functional GenAI, AI/ML, or infrastructure programs from planning through launch and steady-state operations.
- Strong background in capacity planning, forecasting, and infrastructure analytics
. - Advanced SQL skills and hands-on experience building analytics, dashboards, and operational reporting.
- Ability to translate complex data into clear insights and recommendations for engineering and leadership stakeholders.
- Hands-on experience with at least one major cloud provider:
AWS, Azure, or GCP
. - Familiarity with agile methodologies and program management tools such as Jira
. - Comfortable managing ambiguity, driving execution, and handling escalations when needed.
- Master’s degree or advanced technical degree.
- Experience operating LLM, GPU, or GenAI platforms in production environments.
- Background in cloud infrastructure, distributed systems, or platform engineering.
- Previous software or hardware development experience.
Databricks is the data and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).