Member of Technical Staff - Training Infrastructure Engineer
Listed on 2026-01-18
-
IT/Tech
Systems Engineer, Data Engineer
About Liquid AI
Spun out of MIT CSAIL, we build AI systems that run where others stall: on CPUs, with low latency, minimal memory, and maximum reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there.
The OpportunityOur Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership role on a small team with fast feedback loops. We're looking for someone who wants to build critical systems from the ground up rather than inherit mature infrastructure.
While San Francisco and Boston are preferred, we are open to other locations.
What We're Looking ForWe need someone who:
Loves distributed systems complexity: Our team debugs training failures across GPU clusters, optimizes communication patterns, and builds data pipelines that handle multimodal workloads.
Wants to build: We have strong researchers. We need builders who find satisfaction in robust, fast, reliable infrastructure.
Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate fast.
Takes direction and delivers: Our best engineers align with team priorities while pushing back when they see problems.
Design and implement a scalable training infrastructure for our GPU clusters
Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
Develop checkpointing mechanisms balancing memory constraints with recovery needs
Optimize communication patterns to minimize distributed training overhead
Create monitoring and debugging tools for training stability
Must-have:
Hands‑on experience building distributed training infrastructure (PyTorch Distributed, Deep Speed, or Megatron‑LM)
Understanding of hardware accelerators and networking topologies
Experience optimizing data pipelines for ML workloads
Nice-to-have:
MoE (Mixture of Experts) training experience
Large‑scale distributed training (100+ GPUs)
Open‑source contributions to training infrastructure projects
Training run stability has improved (fewer failures, faster recovery)
Data loading bottlenecks are eliminated for multimodal workloads
Time‑to‑recovery from training failures has decreased
Greenfield challenges: Build systems from scratch for novel architectures. High ownership from day one.
Compensation: Competitive base salary with equity in a unicorn‑stage company
Health: We pay 100% of medical, dental, and vision premiums for employees and dependents
Financial: 401(k) matching up to 4% of base pay
Time Off: Unlimited PTO plus company‑wide Refill Days throughout the year
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: