×
Register Here to Apply for Jobs or Post Jobs. X

Foundation Model DevOps Engineer

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Institute of Foundation Models
Full Time position
Listed on 2026-01-20
Job specializations:
  • IT/Tech
    Data Engineer, Cloud Computing
Job Description & How to Apply Below

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We are seeking a Foundation Model Dev Ops Engineer focused on Operational Stability to serve as the backbone of our AI research infrastructure.

You will be designing the friction‑free environment that allows our models to be built. Your mandate is to build the tooling, release pipelines, and storage policies that remove drag on our research team. You will own the  foundational layer , ensuring that our researchers have immediate, secure, and reliable access to the tools, data, and compute they need.

Key Responsibilities
  • Model Release Engineering
    • High‑Fidelity Release Management
      :
      You own the standard of our public presence. You ensure that every release (weights, code, training logs, data) is reproducible, meticulously documented, and packaged with the polish of a top‑tier open‑source product.
    • CI/CD for Research:
      Design and implement pipelines that automate the testing and packaging of complex model releases, moving us away from manual handovers to automated verification.
    • Repo Administration:
      Administer the organization’s Git Hub Enterprise account, ensuring branch protection and clean versioning practices are enforced across the lab.
  • Resource Management & Infrastructure Efficiency
    • Compute Governance:
      Manage the efficiency of our large‑scale GPU resources. You track utilization to identify idle nodes,  zombie jobs , or inefficient scheduling, ensuring we extract maximum value from our compute clusters.
    • Storage Strategy & Hygiene:
      Manage the lifecycle of petabyte‑scale datasets and checkpoint storage. You implement intelligent aging policies to solve the  disk full  bottleneck without risking critical data loss.
    • Quota & Access Logic:
      Proactively manage storage and compute quotas across research teams to prevent resource contention before it blocks a training run.
  • Research Tooling & Orchestration
    • Experiment Management Systems:
      Build and maintain the internal CLI tools and dashboards that allow researchers to launch, track, and organize jobs across thousands of GPUs.
    • Resource Telemetry:
      Set up real‑time monitoring for interconnect throughput, GPU memory, and file system latency to catch performance degradation instantly.
  • Job Orchestration:
    Work closely with infrastructure teams to optimize how we run synthetic data pipelines and large‑scale evaluations, ensuring our tooling scales with our compute.
  • Research Environment Provisioning
    • Automated Workspace Setup:
      Build the scripts and tooling that instantly provision compute environments, permissions, and storage name spaces for researchers (automating away the manual work).
    • Cluster Access Architecture:
      Streamline SSH and node access protocols to ensure friction‑free entry to our massive‑scale compute clusters while maintaining security boundaries.
Academic Qualifications

A bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.

Professional Experience - Minimum (The Bar)
  • 3+ years of experience in Dev Ops, Release Engineering, or MLE, specifically within AI/ML or HPC environments.
  • Foundation Model Fluency:
    You understand the lifecycle of training large models (LLMs or Diffusion). You know what a checkpoint is, understand the difference between pre‑training and inference, and are familiar with the artifacts…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary