More jobs:
Job Description & How to Apply Below
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities.
Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
We are seeking an accomplished Dev Ops Engineer to design, build, and operate secure, scalable, and automated platforms that support advanced AI/ML and Generative AI workloads across Azure and AWS, with solid capability to interoperate with GCP. You will own CI/CD, infrastructure-as-code, container orchestration, observability, and reliability engineering, partnering with Data Science and Security teams to deliver responsible, reliable AI services for healthcare analytics.
Role Summary
We're looking for a Dev Ops Engineer to design, build, and operate secure, scalable, and cost efficient platform capabilities for AI/ML and GenAI workloads on Azure and AWS.
Manage and operate cloud infrastructure to ensure reliability, scalability, and cost efficiency of applications and AI services
Plan and execute CI/CD pipelines across the lifecycle - plan code build test stage release config monitor
Onboard applications to the Dev Ops toolchain standardize golden paths and reusable Terraform modules and Helm charts
Automate testing and deployments end-to-end enforce trunk-based development and automated quality gates
Collaborate with developers to integrate application code with OS/runtime and production infrastructure (container images, base OS hardening, dependencies)
Provide timely support on Dev Ops tooling resolve incidents and requests within SLAs and follow the escalation matrix perform RCA and implement durable fixes
Primary Responsibilities:
Platform, Automation & Reliability Design, provision, and operate production-grade AKS (Azure) and EKS (AWS) clusters implement autoscaling, multi-AZ/region topologies, and safe upgrades
Implement Infrastructure-as-Code with Terraform/Terragrunt and Helm enforce Git Ops with Argo CD or Flux for declarative, auditable changes
Build CI/CD with Git Hub Actions and Azure Dev Ops also support Jenkins, Git Lab CI/CD manage artifact provenance and deployment strategies (blue/green, canary)
Establish observability using Open Telemetry, Prometheus/Grafana, ELK/Open Search, Azure Monitor, and Cloud Watch define SLOs/SLIs
Engineer networking and traffic controls: ingress controllers, API gateways (NGINX/Envoy/Kong), service mesh (Istio/Linkerd), and WAFs implement rate limiting and DDoS protections
AI/ML & GenAI Enablement Operate AI training/inference platforms on Azure Machine Learning and Amazon Sage Maker manage model and data artifacts with MLflow/registries
Operationalize RAG/LLM services with Azure OpenAI and AWS Bedrock standardize serving via KServe or managed endpoints integrate vector databases
Implement data/model lineage, drift detection, shadow testing, and automated rollback based on health and evaluation signals
Security, Compliance & Governance Apply Zero-Trust and least-privilege access (Azure AD, AWS IAM) implement RBAC, workload identity, network segmentation, and pod security standards
Centralize secrets with Azure Key Vault and AWS Secrets Manager/Parameter Store implement rotation and access auditing
Maintain SBOMs and image signing with attestations prevent deployment of non-compliant artifacts automate compliance evidence collection
Operations & Support Run on-call and incident response with playbooks and blameless postmortems drive MTTR reduction and reliability improvements
Provide timely support across multiple platforms ensure customer satisfaction and SLA adherence follow escalation matrix for complex cases
Implement Infrastructure-as-Code with Terraform and Deployment Manager
Build CI/CD pipelines with Git Hub Actions (and Cloud Build where applicable)
Containerize and deploy applications using Docker and Kubernetes (GKE)
Automate operational tasks using Linux, Bash, and Python scripting
Monitor systems with Prometheus, Grafana, Splunk, and Kibana
Dev Ops & SRE Competencies Monitoring and logging solutions:
Prometheus, Grafana, ELK/Elastic Stack, Open Search, Splunk, Kibana
Understanding of security best practices and compliance automation integrated into pipelines
Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×