×
Register Here to Apply for Jobs or Post Jobs. X

Member of Technical Staff - Training Infrastructure Engineer

Job in Boston, Lincolnshire, PE21, England, UK
Listing for: Liquid AI
Apprenticeship/Internship position
Listed on 2026-01-18
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer
Job Description & How to Apply Below

About Liquid AI

Spun out of MIT CSAIL, we build AI systems that run where others stall: on CPUs, with low latency, minimal memory, and maximum reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there.

The Opportunity

Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership role on a small team with fast feedback loops. We're looking for someone who wants to build critical systems from the ground up rather than inherit mature infrastructure.

While San Francisco and Boston are preferred, we are open to other locations.

What We're Looking For

We need someone who:

  • Loves distributed systems complexity: Our team debugs training failures across GPU clusters, optimizes communication patterns, and builds data pipelines that handle multimodal workloads.

  • Wants to build: We have strong researchers. We need builders who find satisfaction in robust, fast, reliable infrastructure.

  • Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate fast.

  • Takes direction and delivers: Our best engineers align with team priorities while pushing back when they see problems.

The Work
  • Design and implement a scalable training infrastructure for our GPU clusters

  • Build data loading systems that eliminate I/O bottlenecks for multimodal datasets

  • Develop checkpointing mechanisms balancing memory constraints with recovery needs

  • Optimize communication patterns to minimize distributed training overhead

  • Create monitoring and debugging tools for training stability

Desired Experience

Must-have:

  • Hands‑on experience building distributed training infrastructure (PyTorch Distributed, Deep Speed, or Megatron‑LM)

  • Understanding of hardware accelerators and networking topologies

  • Experience optimizing data pipelines for ML workloads

Nice-to-have:

  • MoE (Mixture of Experts) training experience

  • Large‑scale distributed training (100+ GPUs)

  • Open‑source contributions to training infrastructure projects

What Success Looks Like (Year One)
  • Training run stability has improved (fewer failures, faster recovery)

  • Data loading bottlenecks are eliminated for multimodal workloads

  • Time‑to‑recovery from training failures has decreased

What We Offer
  • Greenfield challenges: Build systems from scratch for novel architectures. High ownership from day one.

  • Compensation: Competitive base salary with equity in a unicorn‑stage company

  • Health: We pay 100% of medical, dental, and vision premiums for employees and dependents

  • Financial: 401(k) matching up to 4% of base pay

  • Time Off: Unlimited PTO plus company‑wide Refill Days throughout the year

#J-18808-Ljbffr
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary