Senior AI Infrastructure Engineer
Portland, Multnomah County, Oregon, 97204, USA
Listed on 2026-02-28
-
IT/Tech
Systems Engineer, AI Engineer, Cloud Computing, Data Engineer
Senior AI Infrastructure Engineer
Company: WEX
Location: Boston, MA
Remote: Yes (must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; Seattle, WA)
Salary: $121.50K - $145.50K/yr
Type: Full-time
Benefits: Medical, Dental, Vision, Life, Retirement, PTO
Posted: 15 hours ago
Job DescriptionThis is a remote position; however, the candidate must reside within 30 miles of one of the following locations:
Portland, ME;
Boston, MA;
Chicago, IL;
Dallas, TX;
San Francisco Bay Area, CA; and Seattle, WA.
We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.
HowYou'll Make An Impact
You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.
Responsibilities- Platform Architecture:
Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving. - Inference Optimization:
Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token. - Compute Orchestration:
Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs. - Operational Excellence (MLOps):
Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning. - Reliability & Observability:
Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure. - Developer
Experience:
Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure. - Security & Compliance:
Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.
- 5+ years of experience in Dev Ops, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
- Production Expertise:
Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems. - Hardware Fluency:
Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads. - Serving Proficiency:
Experience deploying and scaling open-source LLMs and embedding models using containerized solutions. - Automation First:
Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.
- Core Engineering:
Expert proficiency in Python and Go; comfortable digging into lower-level system performance. - Orchestration & Containers:
Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes. Experience with Ray or Slurm is a huge plus. - Infrastructure as Code:
Advanced skills with Terraform, Cloud Formation, or Pulumi. - Model Serving:
Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or Torch Serve. - Cloud Platforms:
Deep expertise in AWS (EC2, EKS, Sage Maker) or GCP, specifically regarding GPU instance types and networking. - Observability:
Proficiency with Prometheus, Grafana, Data Dog, and tracing tools (Open Telemetry). - Networking:
Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).