AI Infrastructure Engineer
Listed on 2026-02-28
-
IT/Tech
AI Engineer, Systems Engineer
Overview
Are you an experienced GPU systems engineer passionate about building and maintaining high‑performance AI infrastructure? Join our Global Fortune 500 Tech Client
, a world‑renowned global technology leader at their Morrisville, NC campus and play a critical role in supporting advanced machine learning and large‑scale AI workloads.
We’re seeking a skilled AI Infrastructure Engineer who thrives at the intersection of hardware, Linux systems, and ML operations. In this role, you’ll ensure GPU servers and AI clusters run at peak performance—supporting cutting‑edge AI research and enterprise‑scale LLM training environments.
Responsibilities- Monitor, maintain, and optimize high‑performance GPU servers/workstations.
- Diagnose and resolve hardware issues (GPU faults, cooling/power issues, component failures).
- Coordinate hardware repairs, upgrades, and replacements to maintain system uptime.
- Software & Driver Administration:
Install, configure, and update Linux OS (Ubuntu/CentOS), NVIDIA CUDA drivers, and supporting software; ensure compatibility between hardware, drivers, and ML frameworks. - Performance Benchmarking & Optimization:
Execute and analyze MLPerf or similar benchmarking suites; identify bottlenecks and tune configurations for ML training performance; continuously monitor system performance and stability; investigate kernel errors, networking issues, and resource contention within training clusters; implement corrective actions to prevent downtime and performance degradation; manage logging, backups, firmware updates, system patching, and cluster health.
- 3+ years managing GPU‑accelerated servers or HPC environments.
- Hands‑on experience with NVIDIA GPU hardware (A100, H100, etc.) and CUDA toolkit/drivers
. - Familiarity with ML frameworks:
Tensor Flow
, Py Torch , or Hugging Face
. - Skilled in diagnostic tools: nvidia-smi, dmesg, top/htop, Prometheus/Grafana.
- Solid understanding of AI infrastructure, containerization, and distributed training concepts.
- Excellent problem‑solving abilities and proactive ownership of system reliability.
- Experience with cluster orchestration:
Slurm
, Kubernetes
, Ray
. - Knowledge of server hardware diagnostics (IPMI, BIOS configs, RAID arrays).
- Background in MLOps or Dev Ops for AI environments.
- Certifications such as RHCE
, NVIDIA credentials, or similar. - Ability to work independently in a fast‑paced, highly technical environment.
- Direct contribution to large‑scale AI and LLM infrastructure.
- Work on a state‑of‑the‑art enterprise campus with cutting‑edge GPU hardware.
- Join an engineering team driving next‑generation machine learning innovation.
We’re very excited about this incredible opportunity to lead in the AI Workspace with a global leader in computing.
APPLY NOW! HERE!
#J-18808-Ljbffr
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).