Lead Engineer, ML Network Stack - Annapurna Labs
Listed on 2026-03-01
-
IT/Tech
Cloud Computing, AI Engineer, Systems Engineer, Data Engineer
We are seeking an experienced engineer and technical leader to join our team that owns the network stack for EC2 distributed AI/ML systems. The team develops support for a variety of frameworks and communication libraries including NCCL, NVSHMEM, NIXL, NCCL GIN, and Perplexity kernels. Solid knowledge of Linux, networking, and performant coding is important. Experience with embedded systems and high‑speed networking or HPC/RDMA interconnects is highly valued.
If you like solving hard problems, want to work with HPC and ML customers, iterate fast and deliver meaningful solutions at scale, then come join us! This truly is a role at the forefront of AI/ML—you'll be working on features for the largest clusters, with the largest customers, for the largest AI models.
This is a role for a technical lead with the expectation to grow into a technical manager role. We are specifically seeking candidates who want to develop their career as a technical manager.
About the team:
Annapurna Labs, an integral part of AWS that develops hardware and software components that are critical building blocks for EC2 infrastructure.
Be the lead engineer on a team that builds and maintains the infrastructure that monitors and reports on functionality and performance of massive testing workloads run internal Amazon CI/CD tools, Linux, and public AWS products to automate the delivery of our software to customers, saving developer time. Write Python code that effortlessly spools up large clusters and runs benchmarks and applications for ML and HPC workloads.
Use AWS Managed Grafana and Athena to digest the massive amount of performance data generated by these workloads and create dashboards for developers and stakeholders. Invent automatic mechanisms to alert developers to functional and performance regressions so they never reach customers. Manage the complexity of infrastructure that covers many instance types, software stacks, Linux operating systems, cutting‑edge releases and make it easy to evolve.
Qualifications
- 5+ years of non‑internship professional software development experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- 3+ years as a mentor, tech lead or leading engineering teams
- 3+ years experience in SW/HW Co‑Design
- Bachelor’s degree in computer science or equivalent
- Experience creating automated dashboards and visualization (such as Grafana)
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Salary:
USA, CA, Cupertino – $ – $ USD annually
USA, WA, Seattle – $ – $ USD annually
Benefits include: health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and optional Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more at (Use the "Apply for this Job" box below)..
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).