×
Register Here to Apply for Jobs or Post Jobs. X

Senior AI Infrastructure Engineer

Remote / Online - Candidates ideally in
Portland, Multnomah County, Oregon, 97204, USA
Listing for: ChatGPT Jobs
Full Time, Remote/Work from Home position
Listed on 2026-02-28
Job specializations:
  • IT/Tech
    Systems Engineer, AI Engineer, Cloud Computing, Data Engineer
Salary/Wage Range or Industry Benchmark: 121000 - 145000 USD Yearly USD 121000.00 145000.00 YEAR
Job Description & How to Apply Below

Senior AI Infrastructure Engineer

Company: WEX

Location: Boston, MA

Remote: Yes (must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; Seattle, WA)

Salary: $121.50K - $145.50K/yr

Type: Full-time

Benefits: Medical, Dental, Vision, Life, Retirement, PTO

Posted: 15 hours ago

Job Description

This is a remote position; however, the candidate must reside within 30 miles of one of the following locations:
Portland, ME;
Boston, MA;
Chicago, IL;
Dallas, TX;
San Francisco Bay Area, CA; and Seattle, WA.

About The Team

We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.

How

You'll Make An Impact

You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.

Responsibilities
  • Platform Architecture:
    Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.
  • Inference Optimization:
    Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.
  • Compute Orchestration:
    Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.
  • Operational Excellence (MLOps):
    Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
  • Reliability & Observability:
    Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
  • Developer

    Experience:

    Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.
  • Security & Compliance:
    Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.
Experience You'll Bring
  • 5+ years of experience in Dev Ops, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
  • Production Expertise:
    Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems.
  • Hardware Fluency:
    Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.
  • Serving Proficiency:
    Experience deploying and scaling open-source LLMs and embedding models using containerized solutions.
  • Automation First:
    Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.
Technical Skills
  • Core Engineering:
    Expert proficiency in Python and Go; comfortable digging into lower-level system performance.
  • Orchestration & Containers:
    Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes. Experience with Ray or Slurm is a huge plus.
  • Infrastructure as Code:
    Advanced skills with Terraform, Cloud Formation, or Pulumi.
  • Model Serving:
    Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or Torch Serve.
  • Cloud Platforms:
    Deep expertise in AWS (EC2, EKS, Sage Maker) or GCP, specifically regarding GPU instance types and networking.
  • Observability:
    Proficiency with Prometheus, Grafana, Data Dog, and tracing tools (Open Telemetry).
  • Networking:
    Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary