Senior AI Infrastructure Engineer Job Portland Oregon USA,IT/Tech

Senior AI Infrastructure Engineer

Company: WEX

Location: Boston, MA

Remote: Yes (must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; Seattle, WA)

Salary: $121.50K - $145.50K/yr

Type: Full-time

Benefits: Medical, Dental, Vision, Life, Retirement, PTO

Posted: 15 hours ago

Job Description

This is a remote position; however, the candidate must reside within 30 miles of one of the following locations:
Portland, ME;
Boston, MA;
Chicago, IL;
Dallas, TX;
San Francisco Bay Area, CA; and Seattle, WA.

About The Team

We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.

How

You'll Make An Impact

You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.

Responsibilities

Platform Architecture:
Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.
Inference Optimization:
Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.
Compute Orchestration:
Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.
Operational Excellence (MLOps):
Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
Reliability & Observability:
Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
Developer

Experience:

Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.
Security & Compliance:
Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.

Experience You'll Bring

5+ years of experience in Dev Ops, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
Production Expertise:
Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems.
Hardware Fluency:
Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.
Serving Proficiency:
Experience deploying and scaling open-source LLMs and embedding models using containerized solutions.
Automation First:
Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.

Technical Skills

Core Engineering:
Expert proficiency in Python and Go; comfortable digging into lower-level system performance.
Orchestration & Containers:
Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes. Experience with Ray or Slurm is a huge plus.
Infrastructure as Code:
Advanced skills with Terraform, Cloud Formation, or Pulumi.
Model Serving:
Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or Torch Serve.
Cloud Platforms:
Deep expertise in AWS (EC2, EKS, Sage Maker) or GCP, specifically regarding GPU instance types and networking.
Observability:
Proficiency with Prometheus, Grafana, Data Dog, and tracing tools (Open Telemetry).
Networking:
Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language