IT Engineer V
Listed on 2026-03-01
-
IT/Tech
Cloud Computing, Data Engineer, AI Engineer, Systems Engineer
MLOps Platform Engineer
The Data Modeling Analytics & AI Engineering team is seeking an experienced MLOps Platform Engineer to design, build, and support enterprise-grade machine learning operations capabilities. This role will play a key part in enabling scalable, reliable, and secure ML model development and deployment across our cloud and container platforms.
This is a hands‑on engineering role requiring strong expertise in AWS, Kubernetes (EKS), CI/CD automation, containerization, and ML platform operations. The ideal candidate will have solid engineering fundamentals combined with practical knowledge of ML workflows, deployment patterns, and platform reliability.
Key Responsibilities Platform Engineering & Operations- Engineer, manage, and support MLOps platform components across AWS and EKS-based environments.
- Oversee deployment, configuration, and operation of infrastructure used for ML training, batch inference, and real‑time model serving.
- Ensure platform availability, resilience, and performance across dev, test, and production environments.
- Implement role‑based access controls (RBAC), network policies, and scalable namespace designs within EKS.
- Build and support CI/CD pipelines (Git Lab) for model packaging, container image builds, vulnerability scanning, and automated deployment flows.
- Enable standardized model release processes including environment promotion, versioning, and rollback workflows.
- Integrate CI/CD with ML frameworks, model repositories, artifacts, and runtime environments.
- Design and manage EKS workloads supporting containerized ML jobs and microservices.
- Implement auto‑scaling, resource quotas, cluster optimization, and multi‑tenant workload isolation.
- Support GPU and CPU‑based training/inference workloads.
- Implement logging, monitoring, and alerting for ML pipelines, model endpoints, batch jobs, and platform components.
- Analyze compute, storage, and data transfer usage to optimize cost efficiency across ML workloads.
- Perform incident response, root cause analysis, and long‑term remediation planning.
- Partner with Data Scientists, ML Engineers, and application teams to operationalize end‑to‑end machine learning solutions.
- Provide technical guidance on best practices for ML model lifecycle management, deployment patterns, and scalable architectures.
- Contribute to documentation, runbooks, onboarding materials, and internal knowledge bases.
- 3+ years of hands‑on experience with AWS services, including EKS, EC2, S3, IAM, Cloud Watch, and ECR.
- Strong experience operating and troubleshooting Kubernetes (preferably AWS EKS).
- Proficiency in containerization (Docker) and orchestration concepts.
- Strong programming/scripting experience in Python and Bash.
- Experience building and managing CI/CD pipelines (Git Lab or equivalent).
- Familiarity with machine learning workflows, including training, inference, and model monitoring.
- Experience with infrastructure‑as‑code (Terraform or Cloud Formation).
- Experience supporting production platforms, including incident management and root cause analysis.
- Experience managing Data Analytics Platforms / Tools (e.g., Domino, Sage Maker)
- Experience with ML lifecycle tools such as MLflow, or similar.
- Experience supporting GPU‑based workloads or distributed training environments.
- Familiarity with enterprise MLOps architectures and patterns (batch, real‑time, microservices).
- Understanding of data processing frameworks and feature pipelines.
- Strong analytical, troubleshooting, and problem‑solving skills.
- Effective communication and documentation abilities.
- Ability to collaborate across engineering, analytics, and product teams.
- Self‑motivated with the ability to drive initiatives independently.
- Ability to work in a complex, regulated enterprise environment.
EEO:
Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of – Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).