×
Register Here to Apply for Jobs or Post Jobs. X

Software Engineer Lead - Cloud Engineering

Job in Palo Alto, Santa Clara County, California, 94306, USA
Listing for: The Rundown AI, Inc.
Full Time position
Listed on 2026-01-24
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, SRE/Site Reliability, Data Engineer
Salary/Wage Range or Industry Benchmark: 200000 - 250000 USD Yearly USD 200000.00 250000.00 YEAR
Job Description & How to Apply Below

The Cloud Infrastructure team at Kumo is responsible for managing and scaling our Kubernetes-based, cloud-native AI platform across multiple cloud providers. They set service level objectives, optimize resource allocation, enforce security compliance, and drive cost efficiency for the Multi-Cloud Platform.

As a key team member, you will architect and operate a highly scalable, resilient Kubernetes infrastructure to support massive Big Data and AI workloads. You’ll design and implement advanced cluster management strategies, fleet capacity scaling, optimize workload scheduling, and enhance observability r expertise in Kubernetes internals, networking, and performance tuning will be critical in ensuring high availability and seamless scaling.

Joining early, you'll play a pivotal role in shaping platform reliability, automating infrastructure, and enabling ML engineers with efficient commit-to-production automation, Continuous Provisioning, CI/CD, ML Ops, and deployment orchestration and workflows. You'll collaborate with ML scientists, product engineers, and leadership to influence scaling strategies, develop self-service tooling, and drive multi-cloud resilience. Engineers at Kumo take ownership of core system design, building infrastructure that powers the next generation of AI applications.

Key Responsibilities
  • Design, build, and scale Kubernetes-based infrastructure to support Kumo’s multi-cloud AI platform, ensuring high availability, resilience, and performance.
  • Architect and optimize large-scale Kubernetes clusters
    , improving scheduling, networking (CNI), and workload orchestration for production environments.
  • Develop and extend Kubernetes controllers and operators to automate cluster management, lifecycle operations, and scaling strategies.
  • Enhance observability, diagnostics, and monitoring by building tools for real-time cluster health tracking, alerting, and performance tuning
    .
  • Lead efforts to automate fleet management
    , optimizing node pools, autoscaling, and multi-cluster deployments across AWS, GCP, and Azure.
  • Define and implement Kubernetes security policies, RBAC models, and best practices to ensure compliance and platform integrity.
  • Collaborate with ML engineers and platform teams to optimize Kubernetes for machine learning workloads
    , ensuring seamless resource allocation for AI/ML models.
  • Drive commit-to-production automation, cloud connectivity, and deployment orchestration
    , ensuring seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability
    .
Required Skills and Experience
  • Kubernetes Mastery
    : 8-10+ years of experience managing large-scale Kubernetes clusters (EKS, GKE, AKS, or Open Source) in production. Deep expertise in Kubernetes internals
    , including controllers, operators, scheduling, networking (CNI), and security policies
    .
  • Cloud-Native Infrastructure
    : 8-10+ years of experience building cloud-native Kubernetes-based infrastructure across AWS, Azure, and GCP.
  • Platform Engineering
    : 8-10+ years of experience building Kubernetes service meshes (Istio/Envoy, Traefik), networking policies (Calico/Tigera), and distributed ingress/egress control.
  • Fleet Management & Scaling
    :
    Proven experience in optimizing, scaling, and maintaining Kubernetes clusters across multi-cloud environments, ensuring high availability and performance.
  • Software Development
    : 8-10+ years of experience writing production-grade controllers and operators in Python, Go, or Rust to extend Kubernetes functionality.
  • Infrastructure-as-Code & Automation
    :
    Hands-on experience with Terraform, Cloud Formation, Ansible
    , BASH and Make scripting to automate Kubernetes cluster provisioning and management.
  • Distributed Systems & SaaS
    :
    Expertise in building and operating large-scale distributed systems for cloud-native B2B SaaS applications running on Kubernetes.
  • Cloud Application Deployment
    :
    Deep expertise in building of container orchestration, workload scheduling, and runtime optimizations using Kubernetes, Argo or Flux.
  • Education: BS/MS in Computer Science or a related field (PhD preferred)
Nice to Have
  • Proficiency with cloud platforms such as AWS, GCP, or Azure.
  • Familiarity with chaos engineering tools and…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary