HPC Solutions Architect Job San Francisco area,California USA,IT/Tech

About the Company

Our client is building the kind of infrastructure most engineers only read about. They run an AI‑centric cloud that combines huge GPU clusters, high‑speed networks, and cloud‑native tooling into a platform used by enterprises, fast‑growing startups, and advanced research teams. The focus is simple: make it possible to train and run serious AI and simulation workloads without every customer having to build their own supercomputer.

They’re publicly traded and growing quickly with R&D hubs across North America, Europe, and the Middle East. The culture is very engineering‑driven: low on bureaucracy, high on ownership, and built around people who like hard infrastructure problems and seeing their work show up in real customer workloads. You’ll be working with colleagues who care about doing things properly at scale, not just shipping another dashboard.

The

Opportunity – HPC Specialist Solutions Architect (Remote from the US)

You’ll be the person customers turn to when they want to stand up or scale out serious GPU and HPC environments in the cloud: multi‑rack clusters, fast interconnects, complex scheduling, and demanding SLAs around throughput and latency.

As an HPC Specialist Solutions Architect
, you’ll design and tune next‑generation platforms for AI training, large simulations, and data‑heavy workloads. You’ll work directly with NVIDIA’s latest hardware (Hopper, Blackwell, and successors), NVLink/NVSwitch topologies, and Infini Band/RoCE fabrics, and you’ll have a real say in how the platform and reference architectures evolve. If you enjoy going from “here’s the workload” to “here’s the cluster and how we squeeze the last 20–30% out of it,” this will feel like home.

What

You’ll Work On

Design real clusters: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. You’ll think about everything from node types and GPU topology to queues, partitions, and failure modes.
Shape GPU‑accelerated infrastructure: Integrate NVIDIA Hopper and Blackwell‑class GPUs with NVLink/NVSwitch and Infini Band/RoCE, making sure the hardware layout actually matches the communication patterns of the workloads you run.
Automate GPU and network lifecycle: Deploy and manage GPU Operator and Network Operator so that drivers, CUDA, firmware, and high‑speed networking are consistent and automated across large fleets, not managed box by box.
Make the cloud behave like a supercomputer: Design and validate cloud‑native HPC environments that still deliver low latency, high bandwidth, and predictable scheduling. You’ll look at utilization, preemption, fragmentation, and squeeze out performance.
Set the standard for AI/HPC architectures: Define and document reference architectures for AI model training, data pipelines, and MLOps, including observability and CI/CD. When customers ask “how should we do this?”, your work will be what “good” looks like.
Work directly with vendors and partners: Collaborate with NVIDIA and other partners to evaluate new GPU generations, interconnects, and software stacks. You’ll help decide what is ready for prime time and under which conditions.
Debug the hard problems: Benchmark performance, track down bottlenecks across compute, network, and storage, and recommend concrete changes that move the needle—not just check a box.
Be a trusted voice to customers: Lead design sessions, architecture reviews, and operational excellence check‑ins with customers who care a lot about performance and reliability. You’ll translate between “this job keeps timing out” and “here’s what we’ll change in the topology and scheduler.”

What You Bring

A Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus).
3+ years actually building or running HPC or large GPU clusters—on‑prem, cloud, or hybrid. You’ve owned outcomes, not just submitted jobs.
Strong Linux background, plus Kubernetes and container runtimes (containerd, CRI‑O, Docker) in real environments, with CI/CD in the loop.
A solid handle on HPC networking and RDMA:
Infini Band, RoCE, NVLink/NVSwitch. You understand why topology and fabric design matter, and…


Increase/decrease your Search Radius (miles)



Job Posting Language