×
Register Here to Apply for Jobs or Post Jobs. X

Sr. Site Reliability Engineer; SRE

Job in Chicago, Cook County, Illinois, 60290, USA
Listing for: Moonlite
Full Time position
Listed on 2026-01-12
Job specializations:
  • IT/Tech
    Systems Engineer, Network Engineer, Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Position: Sr. Site Reliability Engineer (SRE)

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads. We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on‑demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists blends bare‑metal performance with cloud‑native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise‑grade reliability and compliance.

Your

Role

You will be instrumental in building and operating production‑grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you’ll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking s role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production‑grade clusters from the ground up (not just deploying in managed environments).

You'll ensure enterprise‑grade reliability while establishing the automation, observability, and operational practices.

Job Responsibilities
  • Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare‑metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high‑performance compute workloads.
  • Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR‑IOV for high‑performance GPU interconnects, multi‑tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads.
  • Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare‑metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains.
  • GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization.
  • Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement.
  • Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions.
  • Production Operations & Reliability: Manage production bare‑metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments.
  • Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR.
  • Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads.
Requirements
  • Experience: 5+ years in SRE, Dev Ops, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
  • Kubernetes Infrastructure Expertise: Deep hands‑on experience building and operating production Kubernetes clusters on bare‑metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
  • Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary