Sr. Software Engineer - AI/ML Infra
Listed on 2026-01-12
-
IT/Tech
AI Engineer, Machine Learning/ ML Engineer, Cloud Computing
At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities.
Every day we honor our iconic brand by offering quality coverage to millions of customers and being there when they need us most. We thrive through relentless innovation to exceed our customers’ expectations while making a real impact for our company through our shared purpose.
When you join our company, we want you to feel valued, supported and proud to work here. That’s why we offer The GEICO Pledge:
Great Company, Great Culture, Great Rewards and Great Careers.
GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Platform Engineer to build and scale our machine learning infrastructure with a focus on Large Language Models (LLMs) and AI applications. This role combines deep technical expertise in cloud platforms, container orchestration, and ML operations with strong leadership and mentoring capabilities. You will be responsible for designing, implementing, and maintaining scalable, reliable systems that enable our data science and engineering teams to deploy and operate LLMs efficiently candidate must have excellent verbal and written communication skills with a proven ability to work independently and in a team environment.
KEY RESPONSIBILITIES ML Platform & Infrastructure- Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
- Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
- Design, implement, and maintain feature stores for ML model training and inference pipelines
- Build and optimize LLM inference systems using frameworks like vLLM, Tensor
RT-LLM, and custom serving solutions - Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
- Design and implement ML platforms using Data Robot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
- Develop and maintain infrastructure using Terraform, ARM templates, and Azure Dev Ops
- Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
- Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
- Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
- Design and maintain robust CI/CD pipelines for ML model deployment using Azure Dev Ops, Git Hub Actions, and MLOps tools
- Implement automated model training, validation, deployment, and monitoring workflows
- Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards
- Continuously optimize platform performance, reducing latency and improving throughput for ML workloads
- Design and implement backup, recovery, and business continuity plans for ML platforms
- Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
- Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
- Design and deliver technical onboarding programs for new team members joining the ML platform team
- Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures
- Create technical documentation, runbooks, and deliver internal training sessions on platform capabilities
- Work closely with data scientists to understand requirements and optimize workflows for model development and deployment
- Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications
- Support research teams with infrastructure for experimenting with cutting‑edge LLM techniques and architectures
- Present technical solutions and platform roadmaps to leadership and cross‑functional stakeholders
- Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
- 8+ years of software engineering experience with focus on infrastructure,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).