×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Remote / Online - Candidates ideally in
Singapore
Listing for: Hamilton Barnes ?
Remote/Work from Home position
Listed on 2026-01-16
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 80000 - 100000 SGD Yearly SGD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Senior Site Reliability Engineer – AI & GPU Infrastructure (APAC)

Overview

Our client is a stealth-mode hyperscale data center company building a next-generation AI and cloud platform powered by thousands of NVIDIA GPUs
. The platform is designed to support frontier AI workloads, including large-scale model training, experimentation, and high-throughput inference.

This role represents the first Site Reliability Engineering hire in APAC
, with significant responsibility for reliability, performance, and operational excellence across a large-scale GPU environment. The successful candidate will play a critical role in ensuring the stability and scalability of one of the most advanced private AI infrastructure platforms in production.

Key Responsibilities

  • Design, deploy, and operate hyperscale GPU clusters optimized for AI training and inference workloads.
  • Own Kubernetes and Slurm-based orchestration for GPU workloads, including scheduling efficiency, capacity planning, and fault tolerance.
  • Build automation-driven systems for provisioning, scaling, and managing GPU infrastructure across thousands of nodes.
  • Develop and maintain observability, alerting, and auto-remediation frameworks to support high availability and performance.
  • Collaborate closely with ML, platform, and networking teams to optimize GPU utilization, throughput, and data movement
    .
  • Implement and enforce Infrastructure as Code, CI/CD pipelines, and operational reliability standards
    .
  • Diagnose complex performance and reliability issues across compute, networking, and storage layers.
  • Act as a regional point of ownership, providing clear communication and operational leadership during incidents and reviews.

This is a senior, first-in-region role with a high bar for ownership, reliability, and execution
.

  • Demonstrated ability to operate independently in high-impact environments.
  • Clear, concise communicator, particularly during incidents or critical operational events.
  • Strong sense of accountability and pride in system reliability and operational quality.
  • Proactive in identifying risks and driving continuous improvement.

Required Experience

  • 7+ years of experience in SRE, infrastructure, or platform engineering roles supporting large-scale compute environments.
  • Deep hands-on expertise with Kubernetes in production, particularly for GPU-backed or high-performance workloads.
  • Strong experience with Slurm or comparable workload schedulers
    .
  • Proven experience designing or operating GPU infrastructure at scale
    .
  • Strong proficiency with Infrastructure as Code tools such as Terraform or Pulumi.
  • Programming experience in Python, Go, or Bash for automation and tooling.
  • Experience with observability platforms and incident response (Prometheus, Grafana, Loki, etc.).
  • Demonstrated interest or passion for AI, ML systems, or GPU-centric infrastructure
    .
  • Competitive compensation with equity participation
    .
  • Remote working options.
  • Opportunity to operate and scale cutting-edge AI infrastructure in a high-impact role.

If interested, please apply or reach out to mitc

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary