Director of Engineering,Inference Services Job Sunnyvale area,California USA,IT/Tech

Director of Engineering, Inference Services

Core Weave is The Essential Cloud for AI™. Built for pioneers by pioneers, Core Weave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, Core Weave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025.

Learn more at

About This Role

Core Weave is looking for a Director of Engineering to own and scale our next-generation Inference Platform. In this highly technical, strategic role you will lead a world‑class engineering organization to design, build, and operate the fastest, most cost‑efficient, and most reliable GPU inference services in the industry. Your charter spans everything from model‑serving runtimes (e.g., Triton, vLLM, Tensor

RT‑LLM) and autoscaling micro‑batch schedulers to developer‑friendly SDKs and airtight, multi‑tenant security – all delivered on Core Weave’s unique accelerated‑compute infrastructure.

What You’ll Do

Vision & Roadmap – Define and continuously refine the end‑to‑end Inference Platform roadmap, prioritizing low‑latency, high‑throughput model serving and world‑class developer UX. Set technical standards for runtime selection, GPU/CPU heterogeneity, quantization, and model‑optimization techniques.
Platform Architecture – Design and implement a global, Kubernetes‑native inference control plane that delivers < 50 ms P99 latencies ld adaptive micro‑batching, request‑routing, and autoscaling mechanisms that maximize GPU utilization while meeting strict SLAs. Integrate model‑optimization pipelines (Tensor

RT, ONNX Runtime, Better Transformer, AWQ, etc.) for frictionless deployment.
Runtime Optimizations – Implement state‑of‑the‑art runtime optimizations—including speculative decoding, KV‑cache reuse across batches, early‑exit heuristics, and tensor‑parallel streaming—to squeeze every microsecond out of LLM inference while retaining accuracy.
Operational Excellence – Establish SLOs/SLA dashboards, real‑time observability, and self‑healing mechanisms for thousands of models across multiple regions. Drive cost‑performance trade‑off tooling that makes it trivial for customers to choose the best HW tier for each workload.
Leadership – Hire, mentor, and grow a diverse team of engineers and managers passionate about large‑scale AI inference. Foster a customer‑obsessed, metrics‑driven engineering culture with crisp design reviews and blameless post‑mortems.
Collaboration – Partner closely with Product, Orchestration, Networking, and Security teams to deliver a unified Core Weave experience. Engage directly with flagship customers (internal and external) to gather feedback and shape the roadmap.

Who You Are

10+ years building large-scale distributed systems or cloud services, with 5+ years leading multiple engineering teams.
Proven success delivering mission‑critical model‑serving or real‑time data‑plane services (e.g., Triton, Torch Serve, vLLM, Ray Serve, Sage Maker Inference, GCP Vertex Prediction).
Deep understanding of GPU/CPU resource isolation, NUMA‑aware scheduling, micro‑batching, and low‑latency networking (gRPC, QUIC, RDMA).
Track record of optimizing cost‑per‑token / cost‑per‑request and hitting sub‑100 ms global P99 latencies.
Expertise in Kubernetes, service meshes, and CI/CD for ML workloads; familiarity with Slurm, Kueue, or other schedulers a plus.
Hands‑on experience with LLM optimization (quantization, compilation, tensor parallelism, speculative decoding) and hardware‑aware model compression.
Excellent communicator who can translate deep technical concepts into clear business value for C‑suite and engineering audiences.
Bachelor’s or Master’s in CS, EE, or related field (or equivalent practical experience).

Nice-to-Have

Experience operating multi‑region inference fleets at a cloud provider or hyperscaler.
Contributions to open-source inference or MLOps projects. Familiarity with observability stacks (Prometheus, Grafana, Open Telemetry) for…


Increase/decrease your Search Radius (miles)



Job Posting Language