Sr Kubernetes Support Engineer
Listed on 2026-02-28
-
IT/Tech
Systems Engineer, Cloud Computing, Data Engineer
At Applied Digital, we are the epicenter of AI innovation, crafting cutting-edge data center solutions tailored for the demands of high-performance computing. Designed from the ground up to support AI and machine learning workloads, our infrastructure is the backbone of tomorrow’s technological advancements, including AI-driven video and generative platforms.
We are:
- Forward-Thinkers:
With a keen eye on current market trends and future innovations, we adapt swiftly and lead technological evolution. - Resilient:
We navigate complex challenges and emerge stronger, delivering robust and reliable solutions for industry pioneers. - Innovative Designers:
Leveraging the latest technologies, we create visionary solutions that redefine industry standards.
At Applied Digital, we are committed to solving intricate problems, advancing business initiatives, maximizing operational efficiency, and reducing our carbon footprint. We are a team of resilient, forward-thinking innovators driving the AI revolution.
Position Summary:
Applied Digital is seeking an experienced Sr
Kubernetes Support Engineer to help manage our deployed K8 system, both internal and external. This role will help us support, design and maintain the complex systems that live on our cloud platforms. This role will sit at the center of our product helping develop our entire resource provisioning lifecycle from a single API request to the scheduling and spin-up of multiple resources.
You will be the primary source of contact for our customers using K8 and also taking an architect role for our core provisioning logic, creating a robust system that intelligently orchestrates Kubernetes clusters, Micro VMs and Slurm-managed HPC resources. You will work closely with our front-end team to build the resources that expose this power to our users, and with the infrastructure team to ensure the backend is scalable, resilient, and efficient.
The ideal candidate is a strong systems-level thinker who is passionate about automation, distributed systems, and building powerful HPC clusters that are easily adaptable to a customer’s design requirements.
Key Responsibilities:
- Design & Develop Provisioning Services:
Architect and write high-quality, scalable backend services (e.g., in Go, Python, or Rust) that handle the logic for provisioning and managing compute and storage resources. - Kubernetes Design and Integration:
Develop controllers and operators to automate the deployment and lifecycle management of containerized workloads and services on multiple Kubernetes clusters. - Slurm Orchestration:
Build the "bridge" between our cloud-native API and our HPC backend, writing the logic to dynamically generate Slurm batch scripts, submit jobs, and monitor their state. - Micro
VM Management:
Implement provisioning workflows for lightweight Micro
VMs (using technologies like Firecracker, Kube Virt, or Kata Containers) to ensure fast-boot times and secure workload isolation. - Storage Provisioning:
Write the automation to dynamically provision, attach, and manage various storage solutions (e.g., block storage, shared file systems) for provisioned workloads. - Observability & Monitoring:
Implement comprehensive monitoring, logging, and tracing (using tools like Prometheus, Grafana, Loki) to ensure the health and performance of all systems. - Infrastructure as Code (IaC):
Use tools like Terraform, Ansible and Git to track and manage code version for the Kubernetes cluster and related infrastructure.
Basic Qualifications:
- 10+ years of professional Kubernetes development experience, with a strong focus on building scalable distributed systems. Deep, hands-on experience with Kubernetes in a production environment (cluster management, writing operators, controllers, and custom resource definitions (CRDs)).
- Proficiency in a modern language (e.g., Go, Python, Bash, JSON).
- Solid understanding of container technologies (Docker, container) and the container ecosystem.
- Experience with Infrastructure as Code (IaC) tools like Terraform or Ansible.
- Experience collaborating with front-end teams and defining API contracts.
- Preferred Qualifications
- Direct experience with Slurm or other HPC schedulers (e.g., LSF, PBS).
- E…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).