AI Infrastructure Deployment Lead
Listed on 2025-12-02
-
IT/Tech
Systems Engineer, Network Engineer, IT Project Manager, Cloud Computing
Join to apply for the AI Infrastructure Deployment Lead role at Lambda
.
This range is provided by Lambda. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range$/yr - $/yr
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
As the AI Infrastructure Deployment Lead
, you’ll be responsible for planning, coordinating, and executing the deployment of large-scale AI infrastructure across Lambda’s data centers and customer sites. You’ll lead cross-functional technical teams to design resilient network topologies, oversee rack-level integration, and ensure smooth delivery of compute environments optimized for large-scale training workloads.
This role combines hands-on technical expertise with strategic project leadership — ideal for engineers who thrive at the intersection of hardware, networking, and systems design.
What You’ll Do- Infrastructure Deployment
- Lead end-to-end deployment of GPU clusters, storage systems, and networking fabric across Lambda’s data centers.
- Design and implement data center network topologies optimized for AI and HPC workloads, including high-speed Ethernet and Infini Band environments.
- Oversee rack implementation, cabling, and power/cooling validation for optimal efficiency and scalability.
- Collaborate with supply chain, logistics, and operations teams to ensure smooth delivery and installation timelines.
- Network Engineering
- Implement Layer 2/Layer 3 networks, including VLANs, Spine to Leaf architecture, Infiniband interconnect technology.
- Partner with network architects to ensure redundancy, scalability, and low-latency interconnects for distributed AI workloads.
- Monitor network health, identify bottlenecks, and implement optimizations to maintain peak performance.
- Hardware & Systems Management
- Oversee server hardware troubleshooting, including GPUs, NICs, CPUs, and storage components.
- Lead root-cause analysis for system issues and drive corrective actions in collaboration with vendors and internal hardware teams.
- Develop standard operating procedures (SOPs) for hardware validation, deployment, and maintenance.
- Technical Project Leadership
- Serve as technical project lead for infrastructure rollouts and cluster expansion projects.
- Coordinate cross-functional teams — networking, facilities, cloud operations, and hardware engineering — to execute deployments on schedule.
- Manage project scope, budgets, risk assessments, and post-deployment reviews.
- Communicate status, challenges, and milestones to leadership with clarity and precision.
- Documentation & Continuous Improvement
- Maintain detailed network topology diagrams, deployment runbooks, and hardware inventories.
- Identify opportunities for process automation and infrastructure standardization across deployments.
- Contribute to Lambda’s internal knowledge base and mentor junior engineers on data center best practices.
– What You’ll Bring
- Bachelor’s degree in Computer Engineering, Information Technology, or related field.
- CCNA (Cisco Certified Network Associate) certification (CCNP or equivalent a plus).
- PMP (project Management Professional) Certification (PMP or equivalent a plus).
- 5+ years of experience in data center infrastructure deployment or network operations, preferably in AI, HPC, or cloud environments.
- Proven ability to lead complex technical projects and manage multidisciplinary teams.
- Strong understanding of data center network design (Layer 2/3, VLAN, Rack elevations, port mapping, Infiniband technologies.
- Hands‑on expertise in server hardware troubleshooting and rack‑level integration.
- Ability and willingness to travel 50-70% to our data center sites.
- Experience deploying or managing GPU clusters and distributed training environments.
- Familiarity with automation and orchestration tools (Ansible, Terraform) and monitoring systems (Prometheus, Grafana).
- Knowledge of structured cabling, power distribution, and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).