Software Engineer Lead - Cloud Engineering
Listed on 2026-01-24
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, Data Engineer
The Cloud Infrastructure team at Kumo is responsible for managing and scaling our Kubernetes-based, cloud-native AI platform across multiple cloud providers. They set service level objectives, optimize resource allocation, enforce security compliance, and drive cost efficiency for the Multi-Cloud Platform.
As a key team member, you will architect and operate a highly scalable, resilient Kubernetes infrastructure to support massive Big Data and AI workloads. You’ll design and implement advanced cluster management strategies, fleet capacity scaling, optimize workload scheduling, and enhance observability r expertise in Kubernetes internals, networking, and performance tuning will be critical in ensuring high availability and seamless scaling.
Joining early, you'll play a pivotal role in shaping platform reliability, automating infrastructure, and enabling ML engineers with efficient commit-to-production automation, Continuous Provisioning, CI/CD, ML Ops, and deployment orchestration and workflows. You'll collaborate with ML scientists, product engineers, and leadership to influence scaling strategies, develop self-service tooling, and drive multi-cloud resilience. Engineers at Kumo take ownership of core system design, building infrastructure that powers the next generation of AI applications.
Key Responsibilities- Design, build, and scale Kubernetes-based infrastructure to support Kumo’s multi-cloud AI platform, ensuring high availability, resilience, and performance.
- Architect and optimize large-scale Kubernetes clusters
, improving scheduling, networking (CNI), and workload orchestration for production environments. - Develop and extend Kubernetes controllers and operators to automate cluster management, lifecycle operations, and scaling strategies.
- Enhance observability, diagnostics, and monitoring by building tools for real-time cluster health tracking, alerting, and performance tuning
. - Lead efforts to automate fleet management
, optimizing node pools, autoscaling, and multi-cluster deployments across AWS, GCP, and Azure. - Define and implement Kubernetes security policies, RBAC models, and best practices to ensure compliance and platform integrity.
- Collaborate with ML engineers and platform teams to optimize Kubernetes for machine learning workloads
, ensuring seamless resource allocation for AI/ML models. - Drive commit-to-production automation, cloud connectivity, and deployment orchestration
, ensuring seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability
.
- Kubernetes Mastery
: 8-10+ years of experience managing large-scale Kubernetes clusters (EKS, GKE, AKS, or Open Source) in production. Deep expertise in Kubernetes internals
, including controllers, operators, scheduling, networking (CNI), and security policies
. - Cloud-Native Infrastructure
: 8-10+ years of experience building cloud-native Kubernetes-based infrastructure across AWS, Azure, and GCP. - Platform Engineering
: 8-10+ years of experience building Kubernetes service meshes (Istio/Envoy, Traefik), networking policies (Calico/Tigera), and distributed ingress/egress control. - Fleet Management & Scaling
:
Proven experience in optimizing, scaling, and maintaining Kubernetes clusters across multi-cloud environments, ensuring high availability and performance. - Software Development
: 8-10+ years of experience writing production-grade controllers and operators in Python, Go, or Rust to extend Kubernetes functionality. - Infrastructure-as-Code & Automation
:
Hands-on experience with Terraform, Cloud Formation, Ansible
, BASH and Make scripting to automate Kubernetes cluster provisioning and management. - Distributed Systems & SaaS
:
Expertise in building and operating large-scale distributed systems for cloud-native B2B SaaS applications running on Kubernetes. - Cloud Application Deployment
:
Deep expertise in building of container orchestration, workload scheduling, and runtime optimizations using Kubernetes, Argo or Flux. - Education: BS/MS in Computer Science or a related field (PhD preferred)
- Proficiency with cloud platforms such as AWS, GCP, or Azure.
- Familiarity with chaos engineering tools and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).