Senior Site Reliability Engineer Job Singapore,IT/Tech

Senior Site Reliability Engineer – AI & GPU Infrastructure (APAC)

Overview

Our client is a stealth-mode hyperscale data center company building a next-generation AI and cloud platform powered by thousands of NVIDIA GPUs
. The platform is designed to support frontier AI workloads, including large-scale model training, experimentation, and high-throughput inference.

This role represents the first Site Reliability Engineering hire in APAC
, with significant responsibility for reliability, performance, and operational excellence across a large-scale GPU environment. The successful candidate will play a critical role in ensuring the stability and scalability of one of the most advanced private AI infrastructure platforms in production.

Key Responsibilities

Design, deploy, and operate hyperscale GPU clusters optimized for AI training and inference workloads.
Own Kubernetes and Slurm-based orchestration for GPU workloads, including scheduling efficiency, capacity planning, and fault tolerance.
Build automation-driven systems for provisioning, scaling, and managing GPU infrastructure across thousands of nodes.
Develop and maintain observability, alerting, and auto-remediation frameworks to support high availability and performance.
Collaborate closely with ML, platform, and networking teams to optimize GPU utilization, throughput, and data movement
.
Implement and enforce Infrastructure as Code, CI/CD pipelines, and operational reliability standards
.
Diagnose complex performance and reliability issues across compute, networking, and storage layers.
Act as a regional point of ownership, providing clear communication and operational leadership during incidents and reviews.

This is a senior, first-in-region role with a high bar for ownership, reliability, and execution
.

Demonstrated ability to operate independently in high-impact environments.
Clear, concise communicator, particularly during incidents or critical operational events.
Strong sense of accountability and pride in system reliability and operational quality.
Proactive in identifying risks and driving continuous improvement.

Required Experience

7+ years of experience in SRE, infrastructure, or platform engineering roles supporting large-scale compute environments.
Deep hands-on expertise with Kubernetes in production, particularly for GPU-backed or high-performance workloads.
Strong experience with Slurm or comparable workload schedulers
.
Proven experience designing or operating GPU infrastructure at scale
.
Strong proficiency with Infrastructure as Code tools such as Terraform or Pulumi.
Programming experience in Python, Go, or Bash for automation and tooling.
Experience with observability platforms and incident response (Prometheus, Grafana, Loki, etc.).
Demonstrated interest or passion for AI, ML systems, or GPU-centric infrastructure
.
Competitive compensation with equity participation
.
Remote working options.
Opportunity to operate and scale cutting-edge AI infrastructure in a high-impact role.

If interested, please apply or reach out to mitc

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language