ML Systems Engineer,Infrastructure & Cloud Job Cambridge area,Massachusetts USA,IT/Tech

ML Systems Engineer, Infrastructure & Cloud

Basis Research Institute

Join to apply for the ML Systems Engineer, Infrastructure & Cloud role at Basis Research Institute.

About Basis

Basis is a nonprofit applied AI research organization with two mutually reinforcing goals. The first is to understand and build intelligence, establishing the mathematical principles of reasoning, learning, decision-making, understanding, and explanation, and constructing software that implements these principles. The second is to advance society’s ability to solve intractable problems, expanding the scale, complexity, and breadth of problems we can solve today and accelerating our ability to solve problems in the future.

To achieve these goals, we’re building a new technological foundation inspired by human reasoning and a new collaborative organization that puts human values first.

About

The Role

ML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, enabling researchers to iterate quickly on complex models while managing computational resources efficiently.

We are looking for engineers who combine deep understanding of ML systems with operational excellence. The ideal ML Systems Engineer has experience with distributed training at scale, understands the intricacies of debugging numerical instabilities, and can manage cloud infrastructure that scales from experiments to production.

We Expect You To

Have demonstrated expertise in ML systems engineering (e.g., managing distributed training jobs across hundreds of GPUs, debugging large-scale numerical instabilities, building reproducible ML experiment infrastructure, optimizing training throughput).
Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems.
Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements.
Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to higher-level training loops and evaluation pipelines.
Be skilled at debugging complex stack failures (GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, convergence problems).
Value documentation and knowledge sharing, maintaining comprehensive logs of issues, solutions, and lessons learned.
Progress with autonomy while coordinating closely with researchers, anticipating infrastructure needs, preventing problems before they occur, and responding quickly when issues arise.

Additional Advantage

Experience at organizations training large models (OpenAI, Anthropic, Google, Meta).
Background in both ML research and production systems.
Contributions to ML frameworks or distributed training libraries.
Experience with on-premise GPU cluster management.
Knowledge of optimization theory and numerical methods.
Understanding of robotics-specific infrastructure requirements.

Responsibilities

Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring to ensure experiments run reliably at scale.
Debug and resolve training failures across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions.
Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time.
Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage.
Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data.
Build evaluation and benchmarking infrastructure that enables consistent,…


Increase/decrease your Search Radius (miles)



Job Posting Language