×
Register Here to Apply for Jobs or Post Jobs. X

HPC Engineer - Storage

Job in Indiana Borough, Indiana County, Pennsylvania, 15705, USA
Listing for: World Wide Technology
Full Time position
Listed on 2026-03-12
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

Role

Title:

HPC Engineer – Storage

Location: India (Must align with Client Time Zone)

Employment Type: Full-Time

About the Role

The HPC Engineer - Storage acts as the primary execution engine for the data persistence layer that feeds the AI Factory. While the Domain Architect designs the tiering strategy and namespace layout, you are the "Builder" responsible for mounting file systems, tuning client‑side I/O, and ensuring data reaches the GPUs at line rate. You are a "doer" who is as comfortable compiling a custom kernel module for a parallel file system as you are debugging a stale NFS handle.

As a System Integrator, we design and deliver bespoke, high‑scale AI factories. In this role, you will move beyond standard enterprise NAS management to execute the deployment of ultra‑high‑performance parallel file systems for NVIDIA Super

POD, NVIDIA BasePOD, and Cisco AI Factory
environments.

In this role, you will operate with a 100% focus on Delivery
, executing the Low‑Level Designs (LLD) assigned by your Squad Lead. You will own the "Storage" leg of the critical "Compute-Network-Storage" triad, ensuring that "I/O Wait" never becomes the bottleneck for training jobs.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities
  • Storage Integration & Client Configuration
    • Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare‑metal DGX/HGX nodes using Ansible.

    • Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory.

    • Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s.

    • Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes.

  • Validation & Performance Benchmarking
    • Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the "Gold Standard" read/write targets (e.g., 400GB/s read throughput).

    • Latency Tuning: Tune client‑side kernel parameters (read‑ahead buffers, queue depths, sysctl settings) to minimize latency for small‑file random I/O patterns common in checkpointing.

    • Acceptance Reporting: Generate "As‑Built" storage validation reports, documenting effective throughput and IOPS for client sign‑off.

  • Operations & Support
    • Capacity & Quotas: Implement project‑level quotas and monitor usage trends to prevent "Disk Full" outages on critical scratch file systems.

    • Ticket Resolution: Handle L2 support tickets for storage issues, such as "Stale file handles," "Slow dataset loading," or "CSI Driver crashes."

    • Lifecycle Management: Execute non‑disruptive client‑side driver upgrades and firmware patches during maintenance windows.

    Technical Competencies Essential Skills High‑Performance Storage
    • Parallel File systems: Hands‑on operational experience with at least one major AI storage platform:
      VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).

    • Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.

    • RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying Infini Band/RoCE network.

    Automation & Containerisation
    • Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.

    • Kubernetes Storage: Understanding of Storage Classes
      , PVCs
      , and how to debug CSI Driver pods (checking logs for mount failures).

    • GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.

    Desirable Experience
    • Vendor Specifics: Deep certification or experience with Pure Storage (Flash Blade) or Net App ONTAP AI configurations.

    • Object Storage: Experience…

    To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
    (If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)

    Job Posting Language
    Employment Category
    Education (minimum level)
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary