×
Register Here to Apply for Jobs or Post Jobs. X

Director of Engineering, Inference Services

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: CoreWeave
Full Time position
Listed on 2026-01-12
Job specializations:
  • IT/Tech
    Systems Engineer, AI Engineer, Data Engineer
Salary/Wage Range or Industry Benchmark: 150000 - 200000 USD Yearly USD 150000.00 200000.00 YEAR
Job Description & How to Apply Below

Director of Engineering, Inference Services

Core Weave is The Essential Cloud for AI™. Built for pioneers by pioneers, Core Weave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, Core Weave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025.

Learn more at

About This Role

Core Weave is looking for a Director of Engineering to own and scale our next-generation Inference Platform. In this highly technical, strategic role you will lead a world‑class engineering organization to design, build, and operate the fastest, most cost‑efficient, and most reliable GPU inference services in the industry. Your charter spans everything from model‑serving runtimes (e.g., Triton, vLLM, Tensor

RT‑LLM) and autoscaling micro‑batch schedulers to developer‑friendly SDKs and airtight, multi‑tenant security – all delivered on Core Weave’s unique accelerated‑compute infrastructure.

What You’ll Do
  • Vision & Roadmap – Define and continuously refine the end‑to‑end Inference Platform roadmap, prioritizing low‑latency, high‑throughput model serving and world‑class developer UX. Set technical standards for runtime selection, GPU/CPU heterogeneity, quantization, and model‑optimization techniques.
  • Platform Architecture – Design and implement a global, Kubernetes‑native inference control plane that delivers < 50 ms P99 latencies ld adaptive micro‑batching, request‑routing, and autoscaling mechanisms that maximize GPU utilization while meeting strict SLAs. Integrate model‑optimization pipelines (Tensor

    RT, ONNX Runtime, Better Transformer, AWQ, etc.) for frictionless deployment.
  • Runtime Optimizations – Implement state‑of‑the‑art runtime optimizations—including speculative decoding, KV‑cache reuse across batches, early‑exit heuristics, and tensor‑parallel streaming—to squeeze every microsecond out of LLM inference while retaining accuracy.
  • Operational Excellence – Establish SLOs/SLA dashboards, real‑time observability, and self‑healing mechanisms for thousands of models across multiple regions. Drive cost‑performance trade‑off tooling that makes it trivial for customers to choose the best HW tier for each workload.
  • Leadership – Hire, mentor, and grow a diverse team of engineers and managers passionate about large‑scale AI inference. Foster a customer‑obsessed, metrics‑driven engineering culture with crisp design reviews and blameless post‑mortems.
  • Collaboration – Partner closely with Product, Orchestration, Networking, and Security teams to deliver a unified Core Weave experience. Engage directly with flagship customers (internal and external) to gather feedback and shape the roadmap.
Who You Are
  • 10+ years building large-scale distributed systems or cloud services, with 5+ years leading multiple engineering teams.
  • Proven success delivering mission‑critical model‑serving or real‑time data‑plane services (e.g., Triton, Torch Serve, vLLM, Ray Serve, Sage Maker Inference, GCP Vertex Prediction).
  • Deep understanding of GPU/CPU resource isolation, NUMA‑aware scheduling, micro‑batching, and low‑latency networking (gRPC, QUIC, RDMA).
  • Track record of optimizing cost‑per‑token / cost‑per‑request and hitting sub‑100 ms global P99 latencies.
  • Expertise in Kubernetes, service meshes, and CI/CD for ML workloads; familiarity with Slurm, Kueue, or other schedulers a plus.
  • Hands‑on experience with LLM optimization (quantization, compilation, tensor parallelism, speculative decoding) and hardware‑aware model compression.
  • Excellent communicator who can translate deep technical concepts into clear business value for C‑suite and engineering audiences.
  • Bachelor’s or Master’s in CS, EE, or related field (or equivalent practical experience).
Nice-to-Have
  • Experience operating multi‑region inference fleets at a cloud provider or hyperscaler.
  • Contributions to open-source inference or MLOps projects. Familiarity with observability stacks (Prometheus, Grafana, Open Telemetry) for…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary