AI Performance Engineer Job Austin area,Texas USA,IT/Tech

Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware, software and system level technologies to maximize the efficiency of GPU, CPU and accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. Backed by top-tier venture capital and strategic investors, we are committed to innovation, performance and scalability - solving the world’s most demanding computational challenges with our next-generation networking solutions.

We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track record of building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles.

We’re seeking an AI Performance Engineer that will optimize training and multi-node inference across next‑gen networking silicon and systems—adapters, switches, and the software stack that ties it all together. You’ll partner with architecture, firmware, software, and lighthouse customers to turn lab results into field‑proven wins with an emphasis on distributed serving architectures and P99‑aware optimizations.

Key Responsibilities

Own end-to-end performance for distributed AI workloads (training + multi-node inference) across multi-node clusters and diverse fabrics (Omni-Path, Ethernet, Infini Band).
Benchmark, characterize, and tune open-source & industry workloads (e.g., Llama, Mixtral, diffusion, BERT/T5, MLPerf) on current and future compute, storage, and network hardware, including vLLM/Tensor

RT-LLM/Triton serving paths.
Design and optimize distributed serving topologies (sharded/replicated, tensor/pipe parallel, MoE expert placement), continuous/adaptive batching, KV-cache sharding/offload (CPU/NVMe) & prefix caching, and token streaming with tight p99/p999 SLOs.
Optimize inferencing:
Validate RDMA/GPUDirect RDMA, congestion control, and collective/point-to-point tradeoffs during inference.
Design experiment plans to isolate scaling bottlenecks (collectives, kernel hot spots, I/O, memory, topology) and deliver clear, actionable deltas with latency‑SLO dashboards and queuing analysis.
Build crisp proof points that compare Cornelis Omni-Path to competing interconnects; translate data into narratives for sales/marketing and lighthouse customers, including cost-per-token and tokens/sec-per-watt for serving.
Instrument and visualize performance (Nsight Systems, ROCm/Omnitrace, VTune, perf, eBPF, RCCL/NCCL tracing, app timers) plus serving telemetry (Prometheus/Grafana, Open Telemetry traces, concurrency/queue depth).
Evangelize best practices through briefs, READMEs, and conference-level presentations on distributed inference patterns and anti-patterns.

Minimum Qualifications

B.S. in CS/EE/CE/Math or related
5–7+ years running AI/ML at cluster scale.
Proven ability to set up, run, and analyze AI benchmarks; deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low‑latency serving.
Hands‑on with distributed training beyond single‑GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE).
Practical experience across AI stacks & comms:
PyTorch, Deep Speed, Megatron‑LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod;
Triton Inference Server, vLLM, Tensor

RT-LLM, Ray Serve, KServe.
Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks;
Python + shell power user.
Familiarity with network architectures (Omni-Path/OPA, Infini Band, Ethernet/RDMA/ROCE) and Linux systems at the performance‑tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED.
Excellent written and verbal communication—turn measurements into
persuasion with SLO-driven narratives for inference.

Preferred Qualifications

M.S. in CS/EE/CE/Math or related
Scheduler expertise (SLURM, PBS) and multi‑tenant…


Increase/decrease your Search Radius (miles)



Job Posting Language