Member of Technical Staff - Training Infrastructure Engineer Job Boston area,England UK,IT/Tech

About Liquid AI

Spun out of MIT CSAIL, we build AI systems that run where others stall: on CPUs, with low latency, minimal memory, and maximum reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there.

The Opportunity

Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership role on a small team with fast feedback loops. We're looking for someone who wants to build critical systems from the ground up rather than inherit mature infrastructure.

While San Francisco and Boston are preferred, we are open to other locations.

What We're Looking For

We need someone who:

Loves distributed systems complexity: Our team debugs training failures across GPU clusters, optimizes communication patterns, and builds data pipelines that handle multimodal workloads.
Wants to build: We have strong researchers. We need builders who find satisfaction in robust, fast, reliable infrastructure.
Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate fast.
Takes direction and delivers: Our best engineers align with team priorities while pushing back when they see problems.

The Work

Design and implement a scalable training infrastructure for our GPU clusters
Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
Develop checkpointing mechanisms balancing memory constraints with recovery needs
Optimize communication patterns to minimize distributed training overhead
Create monitoring and debugging tools for training stability

Desired Experience

Must-have:

Hands‑on experience building distributed training infrastructure (PyTorch Distributed, Deep Speed, or Megatron‑LM)
Understanding of hardware accelerators and networking topologies
Experience optimizing data pipelines for ML workloads

Nice-to-have:

MoE (Mixture of Experts) training experience
Large‑scale distributed training (100+ GPUs)
Open‑source contributions to training infrastructure projects

What Success Looks Like (Year One)

Training run stability has improved (fewer failures, faster recovery)
Data loading bottlenecks are eliminated for multimodal workloads
Time‑to‑recovery from training failures has decreased

What We Offer

Greenfield challenges: Build systems from scratch for novel architectures. High ownership from day one.
Compensation: Competitive base salary with equity in a unicorn‑stage company
Health: We pay 100% of medical, dental, and vision premiums for employees and dependents
Financial: 401(k) matching up to 4% of base pay
Time Off: Unlimited PTO plus company‑wide Refill Days throughout the year

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language