Principal Architect - AI
Washington, District of Columbia, 20022, USA
Listed on 2026-03-07
-
IT/Tech
Systems Engineer, Cloud Computing, AI Engineer, Data Engineer
Overview
The Principal Architect leads HPCAI focused Professional Services delivery engagements and cross functional technical teams on customer programs or projects. They are responsible for technical communications with Engineers, Architects, and the customer for AI-driven projects. The Principal Architect may participate in several Customer projects concurrently, integrating AI solutions with enterprise IT systems.
Role SummaryThe Principal Architect will be at the epicenter of the AI revolution, working with the most advanced hardware on the planet. Whether you’re helping a research facility unlock new scientific breakthroughs or an enterprise to build its first private AI cloud, your fingerprints will be on the infrastructure that defines the next decade of technology.
The right person for the job is a senior individual contributor responsible for designing, implementing, and optimizing large‑scale High‑Performance Computing and AI platforms centered on the NVIDIA data center ecosystem. This role operates in a hybrid capacity, combining hands‑on technical architecture with selective customer‑facing advisory responsibilities.
The architect serves as a technical authority across GPU‑accelerated compute, high‑performance networking, and modern parallel storage platforms, influencing architectural standards and delivery outcomes while ensuring successful, on‑time, and on‑budget customer deployments without escalations.
This is a remote work from home position, with an average travel expectation of approximately 10%, and a willingness for additional travel during peak project phases or critical customer engagements.
Key Responsibilities- Lead the end‑to‑end architecture of GPU‑accelerated HPC and AI platforms, including greenfield AI factory designs and optimization of existing HPC environments.
- Architect integrated solutions spanning Compute, Networking, and Storage using NVIDIA HGX and DGX platforms, Grace CPU architectures, Spectrum‑X networking, and high‑performance parallel storage systems.
- Design storage architectures optimized for AI training, inference, and HPC workloads, balancing performance, scalability, resiliency, and cost.
- Define reference architectures, design patterns, and best practices for repeatable and supportable customer deployments.
- Provide hands‑on technical leadership during implementation phases, including cluster bring‑up, performance tuning, and workload optimization.
- Architect and integrate workload orchestration and scheduling platforms using NVIDIA Base Command Manager, Slurm, Kubernetes, and Run:
AI. - Optimize end‑to‑end data pipelines, including GPU utilization, storage throughput, metadata performance, and job scheduling efficiency.
- Troubleshoot performance bottlenecks across Compute, Networking, and Storage.
- Design and validate high‑performance storage solutions using modern parallel and scale‑out storage platforms.
- Demonstrate hands‑on experience with at least one of the following storage technologies: VAST Data, WEKA, DDN, Lustre, Net App.
- Architect storage solutions that support demanding AI and HPC workloads, including high‑throughput training pipelines, checkpointing, and large‑scale shared datasets.
- Collaborate with compute and networking design to ensure balanced, bottleneck‑free architectures.
- Act as a senior technical authority for HPC and AI architecture across internal teams and customer engagements.
- Participate selectively in customer‑facing discussions to validate architecture and delivery plans, with a primary focus on design integrity and execution rather than pre‑sales.
- Influence platform standards, architectural direction, and technical decision‑making through expertise and demonstrated execution.
- Identify technical risks early across Compute, Networking, Storage, and orchestration layers, and drive mitigation strategies.
- Partner with the PMO counterpart to resolve Risks and Issues upon identification and to ensure production‑ready, supportable platforms.
- Ensure staff, contractors, and partners adhere to best practices and templates for AI solution delivery.
- Review deployment documents, technical assessments, and other outputs to ensure consistency and accuracy,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).