Senior/Principal - Artificial Intelligence Infrastructure, NM/CA- Hybrid
Topeka, Shawnee County, Kansas, 66652, USA
Listed on 2026-02-12
-
IT/Tech
Systems Engineer, Cybersecurity, AI Engineer
About Sandia:
Sandia National Laboratories is the nation’s premier science and engineering lab for national security and technology innovation, with teams of specialists focused on cutting-edge work in a broad array of areas. Some of the main reasons we love our jobs:
- Challenging work with amazing impact that contributes to security, peace, and freedom worldwide
- Extraordinary co-workers
- Some of the best tools, equipment, and research facilities in the world
- Career advancement and enrichment opportunities
- Flexible work arrangements for many positions include 9/80 (work 80 hours every two weeks, with every other Friday off) and 4/10 (work 4 ten-hour days each week) compressed workweeks, part-time work, and telecommuting (a mix of onsite work and working from home)
- Generous vacation, strong medical and other benefits, competitive 401k, learning opportunities, relocation assistance and amenities aimed at creating a solid work/life balance*
World-changing technologies. Life-changing careers. Learn more about Sandia at: http://(Use the "Apply for this Job" box below)..gov
* These benefits vary by job classification.
What Your Job Will Be Like:Sandia's artificial intelligence (AI) team is building the U.S. Department of Energy's (DOE) next-generation AI Platform, an integrated scientific AI capability that delivers rapid, high-impact solutions for national security, science, and applied energy missions. The Platform is based on three pillars:
Models, Infrastructure, and Data. You will join the Infrastructure Pillar team to design, deploy, and operate the unified compute-and-data fabric that underpins all mission workflows from AI model training and simulation steering to real-time inference at experimental and production facilities.
We anticipate multiple hires for the Infrastructure Pillar that collectively span the set of responsibilities and skills described below. Likewise, new hires will be expected to work in conjunction with existing Sandia staff and teams from other DOE laboratories to deliver on this ambitious, fast-paced project. Importantly, we anticipate that while AI Platform development will leverage existing AI and data science tools extensively, success will also require considerable innovation and problem solving to address the unique needs of DOE applications.
If this sounds like an exciting challenge to you, we look forward to reading your application!
- Architect and implement the hybrid compute fabric
- Integrate exascale HPC systems with elastic cloud resources and specialized AI accelerator clusters (on-prem and in-cloud)
- Deploy ruggedized edge servers and digital-twin infrastructure for sub-millisecond inference and real-time physics simulations
- Develop infrastructure services and orchestration
- Build federated Kubernetes clusters, container registry services, resource registry, and job scheduling abstractions
- Implement self-configuring distributed clusters with intelligent network overlays, AI-driven traffic steering, and sensor-driven control loops
- Design secure networking and enclaves
- Configure ESnet-backed, multi-tier WAN overlays with low-latency, geo-diverse routing, failover, and encryption protocols
- Provide software-defined, dynamic security enclaves for CUI/Restricted Data with attested runtime and curated egress
- Enable observability, provenance & monitoring
Deploy unified logging, metrics, dashboards, and trace-analysis across cloud and on-prem environments using Open Telemetry, Prometheus, ELK, or equivalent.
Automate provenance capture for compute jobs, data movements, and AI workflows.
Support federated identity and access control.
Integrate multiple identity providers, attribute-based access controls, and allocation models for risk-shared governance.
Manage enterprise licensing, token agreements, and software audits for AI and HPC frameworks.
Manage the full lifecycle of the AI platform's infrastructure, including capacity planning, upgrades, documentation, and performance monitoring.
Implement and enforce security best practices within container environments, including Role-Based Access Control (RBAC), secrets management, network policies, and vulnerability scanning.
On any given day, you may be called upon to- Stand up a new GPU-accelerated cluster, configure Slurm/Kubernetes, and validate performance benchmarks
- Troubleshoot cross-site data transfers over ESnet and optimize WAN throughput for a petabyte-scale lakehouse
- Deploy a hardened enclave for a classified ML training job with differential-privacy egress controls
- Script an IaC workflow (Terraform/Ansible) to provision edge compute nodes
- Collaborate with the Models team to tune network and storage parameters for distributed training jobs
- Present real-time infrastructure status and forecasts to stakeholder
- Present prototype demos and research results to stakeholders across DOE, DoD, IC, and industry
The selected applicant can work a combination of onsite and offsite work. The selected applicant must live within a reasonable distance for commuting to the assigned work…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).