Senior Software Engineer, Server Fleet Infrastructure
Listed on 2026-01-12
-
Software Development
Software Engineer
Senior Software Engineer, Server Fleet Infrastructure
Core Weave is The Essential Cloud for AI™. Built for pioneers by pioneers, Core Weave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, Core Weave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025.
Learn more at
At Core Weave, we don’t box people into rigid job titles—we look for exceptional engineers and match them to the work that excites them most. Instead of measuring you against a narrow checklist of qualifications, we hire based on broad technical domains and use our interview process to determine where you’ll have the biggest impact. Tell us what interests you most, and throughout the hiring process, we’ll get a sense of your strengths, expertise, and aspirations.
If you join Core Weave, you’ll land on the team where you can do your best work—driving innovation, solving complex problems, and shaping the future of cloud computing.
At Core Weave, infrastructure isn’t just a foundation, it’s a product. We build scalable, high-performance computing systems that power the largest AI workloads in the world. We’re looking for Engineers that thrive at the intersection of software and systems, deploying and managing large scale bare metal compute. Within this domain, you’ll design and build software that manages complex infrastructure across globally distributed datacenters.
Working in Go, Python/Ansible, deep in Linux environments, observability/monitoring stacks, and leveraging technologies like gRPC and Kubernetes CRDs/Controllers/Operators. Whether you’re automating bare metal, building fleet lifecycle management services, solving multi-layer integration challenges, or observing our globally distributed fleet, your work will be critical to the company's delivery of reliable and efficient infrastructure.
- Design and implement solutions to problems of scale for multi-site deployment and management of Core Weave’s global server hardware fleet.
- Build and maintain backend services and APIs (gRPC/REST) in Go or Python to interact with Kubernetes and other infrastructure systems.
- Develop provisioning services, automation workflows, and fleet management tools that span from bare metal to container orchestration.
- Write and maintain Kubernetes custom controllers and operators to automate infrastructure behavior.
- Design and implement observability solutions for large-scale server monitoring to improve system stability and insight.
- Adapt and extend open source tooling to enhance visibility into system metrics, performance, and health.
- Create test plans, deployment automation, dashboards, alerts, and insights into our fleet operations.
- Resolve integration challenges across the entire infrastructure stack, from data center hardware to orchestration platforms.
- Participate in an on-call rotation.
Minimum Qualifications
- 5+ years of experience in software or infrastructure engineering.
- Proficiency in Go and/or Python software development.
- Familiarity with CI/CD tools like Argo, Flux, and Git Hub Actions.
- Strong understanding of Linux internals.
- Experience designing, implementing, and monitoring Kubernetes operators for custom resource definitions.
- Experience with infrastructure automation and configuration management tools like Ansible, Puppet, Chef, Salt.
- Experience with distributed cloud computing principles, including testing strategies, observability, error‑budget, and fault‑tolerant design.
- Experience implementing metrics pipelines, custom alerts, and monitoring strategies.
- Ability to break down complex problems into achievable tasks and collaborate with teammates to execute them.
- Willingness and ability to thrive in a fast‑paced startup environment.
Automation is key to delivering reliable GPU compute to clients. On this team…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).