×
Register Here to Apply for Jobs or Post Jobs. X

Lead GenAI Engineer

Job in Cupertino, Santa Clara County, California, 95014, USA
Listing for: Apple Inc.
Full Time position
Listed on 2026-01-12
Job specializations:
  • Software Development
    AI Engineer, Machine Learning/ ML Engineer
Salary/Wage Range or Industry Benchmark: 125000 - 150000 USD Yearly USD 125000.00 150000.00 YEAR
Job Description & How to Apply Below

Cupertino, California, United States | Software and Services

Imagine what you could do here. At Apple, innovative ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish! As part of Apple Cloud AI, we are building the next generation of ML infrastructure that powers AI capabilities across Apple's products and services.

Our team tackles some of the most challenging problems in the industry – optimizing LLM inference at massive scale, building distributed training systems that push the boundaries of GPU and TPU utilization, and architecting model serving platforms that deliver sub‑millisecond latency for real‑time AI experiences.

You'll work with cutting‑edge technologies including vLLM, Ray, Tensor

RT‑LLM, TPU Infrastructure, and custom inference engines, while shaping how foundation models are trained, fine‑tuned, and deployed across Apple's ecosystem. As a Lead GenAI/ML Engineer, you will architect high‑performance ML systems from the ground up – designing efficient KV‑cache strategies, implementing speculative decoding, optimizing tensor parallelism across GPU and TPU clusters, and building the infrastructure that brings Apple's most ambitious AI capabilities to life.

Description

This role requires translating cutting‑edge ML research into production‑ready systems that meet the demanding requirements of Apple's ML workloads. You will work closely with research teams to product ionize new model architectures and optimization techniques. We are looking for candidates who thrive at the intersection of ML research and systems engineering – someone who can read a paper on Flash Attention or Paged Attention and implement a production‑grade version, or who can profile a training job and identify opportunities to improve GPU utilization from 40% to 80%.

Responsibilities
  • In this role, you will have significant responsibilities in advancing the technical capabilities of Apple Cloud AI by building robust, scalable ML infrastructure. You will influence the technical direction of our ML platform by driving innovation in distributed training, inference optimization, and model serving systems.
  • LLM Inference Optimization:
    Design and implement high‑performance inference pipelines, including KV‑cache optimization, continuous batching, speculative decoding, and quantization strategies (INT8, FP8, AWQ, GPTQ).
  • Distributed Training Systems:
    Build and optimize large‑scale training infrastructure across GPU and TPU clusters, implementing efficient data/tensor/pipeline parallelism strategies.
  • Model Serving at Scale:
    Architect low‑latency serving systems capable of handling Apple‑scale traffic with strict SLA requirements.
  • Hardware‑Aware Optimization:
    Deep optimization for NVIDIA GPUs (H100, B200) and Google TPUs, including custom CUDA kernels and XLA optimizations.
Minimum Qualifications
  • 8+ years of experience in ML systems engineering, with at least 3 years focused on LLM/GenAI infrastructure.
  • Deep expertise in LLM inference optimization: KV‑cache management, batching strategies, quantization, speculative decoding.
  • Strong proficiency in Python and C++/CUDA for performance‑critical code.
  • Hands‑on experience with inference frameworks: vLLM, Tensor

    RT‑LLM, Triton Inference Server, or equivalent.
  • Experience with distributed training at scale using frameworks like Deep Speed, Megatron‑LM, FSDP, or Ray Train.
  • Solid understanding of transformer architectures and attention mechanisms at the implementation level.
  • Experience optimizing ML workloads on NVIDIA GPUs (profiling, memory optimization, kernel tuning).
  • Track record of taking ML systems from research/prototype to production at scale.
  • MS or PhD in Computer Science, Machine Learning, or equivalent practical experience.
Preferred Qualifications
  • Experience with TPU infrastructure (JAX/XLA, TPU training/serving optimization).
  • Contributions to open‑source ML infrastructure projects (vLLM, Ray, Tensor

    RT‑LLM, etc.).
  • Experience with custom CUDA kernel development or Triton (OpenAI).
  • Deep knowledge of model compression techniques: pruning, distillation,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary