Senior Site Reliability Engineer
Singapore
Listed on 2026-01-16
-
IT/Tech
Systems Engineer, Cloud Computing
Senior Site Reliability Engineer – AI & GPU Infrastructure (APAC)
Overview
Our client is a stealth-mode hyperscale data center company building a next-generation AI and cloud platform powered by thousands of NVIDIA GPUs
. The platform is designed to support frontier AI workloads, including large-scale model training, experimentation, and high-throughput inference.
This role represents the first Site Reliability Engineering hire in APAC
, with significant responsibility for reliability, performance, and operational excellence across a large-scale GPU environment. The successful candidate will play a critical role in ensuring the stability and scalability of one of the most advanced private AI infrastructure platforms in production.
Key Responsibilities
- Design, deploy, and operate hyperscale GPU clusters optimized for AI training and inference workloads.
- Own Kubernetes and Slurm-based orchestration for GPU workloads, including scheduling efficiency, capacity planning, and fault tolerance.
- Build automation-driven systems for provisioning, scaling, and managing GPU infrastructure across thousands of nodes.
- Develop and maintain observability, alerting, and auto-remediation frameworks to support high availability and performance.
- Collaborate closely with ML, platform, and networking teams to optimize GPU utilization, throughput, and data movement
. - Implement and enforce Infrastructure as Code, CI/CD pipelines, and operational reliability standards
. - Diagnose complex performance and reliability issues across compute, networking, and storage layers.
- Act as a regional point of ownership, providing clear communication and operational leadership during incidents and reviews.
This is a senior, first-in-region role with a high bar for ownership, reliability, and execution
.
- Demonstrated ability to operate independently in high-impact environments.
- Clear, concise communicator, particularly during incidents or critical operational events.
- Strong sense of accountability and pride in system reliability and operational quality.
- Proactive in identifying risks and driving continuous improvement.
Required Experience
- 7+ years of experience in SRE, infrastructure, or platform engineering roles supporting large-scale compute environments.
- Deep hands-on expertise with Kubernetes in production, particularly for GPU-backed or high-performance workloads.
- Strong experience with Slurm or comparable workload schedulers
. - Proven experience designing or operating GPU infrastructure at scale
. - Strong proficiency with Infrastructure as Code tools such as Terraform or Pulumi.
- Programming experience in Python, Go, or Bash for automation and tooling.
- Experience with observability platforms and incident response (Prometheus, Grafana, Loki, etc.).
- Demonstrated interest or passion for AI, ML systems, or GPU-centric infrastructure
. - Competitive compensation with equity participation
. - Remote working options.
- Opportunity to operate and scale cutting-edge AI infrastructure in a high-impact role.
If interested, please apply or reach out to mitc
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).