HPC Engineer - Network
Listed on 2026-03-12
-
IT/Tech
Systems Engineer, Network Engineer
Role
Title:
HPC Engineer – Network
Location: India (Must align with Client Time Zone)
Employment Type: Full-Time
About the RoleThe HPC Engineer - Network acts as the primary execution engine for the connectivity that binds the AI Factory together. While the Domain Architect designs the topology and the Senior Engineer directs the squad, you are the "Builder" responsible for configuring, stabilising, and tuning the high-speed interconnects. You are a "doer" who is as comfortable debugging a flapping Infini Band link via CLI as you are pushing a configuration update to fifty switches using Ansible.
As a System Integrator, we design and deliver bespoke, high-scale AI factories. In this role, you will move beyond standard enterprise networking (campus/wifi) to execute the deployment of lossless, high-bandwidth fabrics for NVIDIA Super
POD, NVIDIA BasePOD, and Cisco AI Factory environments.
In this role, you will operate with a 100% focus on Delivery , executing the Low-Level Designs (LLD) assigned by your Squad Lead. You will own the "Network" in the critical "Compute-Network-Storage" triad, ensuring that the GPU compute nodes can communicate at line rate without congestion.
CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).
Key Responsibilities- Switch Configuration: Execute the configuration of high-performance switches (NVIDIA Quantum Infini Band, NVIDIA Spectrum-X Ethernet, Cisco Nexus) using defined templates and automation.
- Net Dev Ops Execution: Run and maintain Ansible playbooks to push configurations, update firmware (Cumulus/NX-OS), and enforce compliance across the fabric.
- Subnet Management: Configure and tune NVIDIA Unified Fabric Manager (UFM) to ensure optimal routing and fault tolerance.
- Host Networking: Assist the Compute team in configuring host-side adapters (Connect
X Super
NICs, Blue Field DPUs) to ensure correct IP addressing, MTU, and driver parameters.
- Link Validation: Verify physical connectivity and link health using tools like ibstat, ibdiagnet, and ethtool to identify faulty cables or transceivers immediately after installation by Field Engineers.
- Performance Testing: Execute network-specific benchmarks (e.g., ibwritebw, ibsendbw, NCCL-tests) to validate that the fabric is delivering full bi-sectional bandwidth and low latency.
- Congestion Control: Implement and tune Quality of Service (QoS) settings, including Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and Data Center Quantized Congestion Notification (DCQCN) to prevent packet loss and optimize throughput in RoCEv2 environments.
- Fabric Telemetry: Configure monitoring agents (e.g., Prometheus node exporters, UFM Telemetry) to visualize traffic flows and detect "tail latency" issues.
- Ticket Resolution: Handle L2 support tickets for network issues, such as "Node isolation," "Slow All-Reduce operations," or "Fabric flapping."
- Lifecycle Management: Execute firmware upgrades on switches and DPUs during maintenance windows, ensuring strict compatibility with the NVIDIA OFED stack.
- Infini Band Mastery: Deep operational knowledge of NVIDIA Quantum Infini Band switches, cable types (NDR/HDR), and troubleshooting commands.
- AI Ethernet (RoCEv2): Solid understanding of RDMA over Converged Ethernet (RoCEv2), including the configuration of PFC and ECN on switches (Spectrum/Arista/Cisco/Juniper).
- Fabric Management:
Experience with
NVIDIA UFM (Unified Fabric Manager) for managing large-scale fabrics.
- Net Dev Ops : Proficiency in Ansible for network automation (e.g., ansible-networking collections).
- Linux Networking: Comfortable navigating Linux CLI to troubleshoot host-side networking (ip link, tcpdump, sysctl tuning).
- Protocol Knowledge: Practical implementation skills in BGP, EVPN, and VXLAN for multi-tenant AI clouds.
- Cisco AI Integration: Experience with Cisco Nexus Dashboard or Cisco 8000 series in…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).