Software Development Engineer III - ML Ops
Listed on 2026-03-01
-
IT/Tech
Cloud Computing, AI Engineer, Machine Learning/ ML Engineer, Data Engineer
Overview
Everseen
: A leader in vision AI solutions for the world’s leading retailers.
The Role
We are seeking a Machine Learning Platform/Backend Engineer to design, build, and maintain scalable infrastructure that empowers our data scientists and machine learning engineers to develop, train, benchmark, and monitor machine learning models efficiently. You will be instrumental in shaping our internal Machine Learning Platform and driving automation, reproducibility, and performance across the machine learning lifecycle. As part of this role, you will own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure).
You will also design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring along with integrating and extending orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines. You will work to develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries) and have a clear focus on building out documentation in relation to architecture, policies and operations runbooks.
What you ll do
- Teaching and Sharing Culture: Share skills, knowledge, and expertise with members of the data engineering team. Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions.
- Design and Development: Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities. Ensure that the developed solutions meet project objectives and enhance user experience. Influence and decision-making by contributing to strategic technical improvements.
- Coding: Based on requirements and a longer-term product and feature strategy, design and implement reusable, testable, efficient, and elegant code. Ensure adherence to coding standards and best practices.
- Testing: Create, maintain, and run unit tests for new and existing applications and services. Aim to deliver defect-free and well-tested solutions.
- Data Analysis: Analyze and collect data from various sources such as log files, application stack traces, and thread dumps. Utilize data analysis to identify trends, patterns, and potential areas for improvement. Begin to implement changes.
- Continuous Integration and Continuous Deployment (CI/CD): Create and maintain CI/CD integration using various tools. Automate the build, test, and deployment processes to ensure efficiency and reliability.
- Integration of Third-Party Solutions: Research and propose third-party software solutions to optimize system performance. Expand product capabilities by integrating compatible third-party solutions. Monitor updates and tracking of third-party solutions compatibility with Everseen stack according to internal development guidelines.
- Monitoring and Troubleshooting: Monitor production logs to identify and troubleshoot issues promptly. Ensure seamless operation and timely resolution of any anomalies to maintain system reliability.
- Documentation: Responsible for creating, reviewing, and maintaining high-quality technical documentation to ensure clarity, consistency, and knowledge sharing within the development team.
Collaborate with
- AI/ML Engineering team
- Data Engineering team
- Software Development Engineers
- Dev Ops team
- Product Managers
- Security & Compliance Teams
Profile and Skills
- 4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
- Bachelors degree or equivalent focusing on the computer science field is preferred
- Excellent communication and collaboration skills.
- Technical
Skills:
Expert knowledge of Python. Experience with CI/CD tools (e.g., Git Lab, Jenkins). Hands-on experience with Kubernetes, Docker, and cloud services. Understanding of ML training pipelines, data lifecycle, and model serving concepts. Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).