AI DevOps and Cloud Infrastructure Engineer
Listed on 2026-01-10
-
IT/Tech
Systems Engineer, AI Engineer, Data Engineer -
Engineering
Systems Engineer, AI Engineer, Data Engineer
AI Dev Ops and Cloud Infrastructure Engineer
Apply for the AI Dev Ops and Cloud Infrastructure Engineer role at Crowe
.
The AI Dev Ops and Cloud Infrastructure Engineer I (Senior Staff) designs, builds, and operates scalable, secure, and highly automated cloud environments that support the training, deployment, monitoring, and continuous delivery of AI and machine learning systems. This role serves as a subject‑matter expert in infrastructure automation, distributed compute orchestration, and cloud platform operations, ensuring AI workloads perform reliably across development, staging, and production environments.
Responsibilities- Architect and maintain cloud infrastructure for AI model training, inference services, and distributed compute workloads.
- Implement infrastructure‑as‑code (IaC) to automate provisioning, configuration, scaling, and lifecycle management of cloud resources.
- Design and operate CI/CD pipelines for automated model training, testing, and deployment of AI‑enabled applications.
- Optimize Kubernetes clusters, GPU utilization, and compute scaling strategies to balance performance, reliability, and cost.
- Integrate AI models, inference endpoints, and data pipelines into cloud‑native platforms.
- Develop monitoring, logging, alerting, and observability solutions using modern telemetry and tracing tools.
- Troubleshoot issues across networking, containers, compute, storage, and model‑serving layers.
- Lead performance benchmarking, load testing, and reliability validation for AI systems.
- Document infrastructure architectures, operational runbooks, and engineering standards.
- Support automation for dataset ingestion, model versioning, artifact management, and ML testing.
- Ensure compliance with cloud security, identity management, encryption, and responsible AI guidelines.
- Partner with security teams to implement secure networking, IAM policies, and secrets management.
- Provide technical mentorship, design reviews, and cloud best‑practice guidance to junior engineers.
- Evaluate new cloud services, platform capabilities, and AI infrastructure tooling for adoption.
- 4+ years of experience in Dev Ops, cloud engineering, platform engineering, or infrastructure engineering.
- Strong proficiency with Kubernetes, Docker, and cloud orchestration platforms.
- Extensive experience with CI/CD systems and deployment automation.
- Demonstrated ability to debug distributed systems and cloud networking issues.
- Proficiency in Python, Bash, or other automation/scripting languages.
- Strong communication skills and ability to collaborate across engineering and security teams.
- Willingness to travel occasionally for cross‑functional planning and collaboration.
- Bachelor’s degree in Computer Science, Cloud Engineering, Information Systems, or a related technical field, or equivalent experience.
- Master’s degree in a technical discipline.
- Experience enabling ML or AI workloads at scale in production environments.
- Cloud and platform certifications, including Azure (AZ‑900, AZ‑104, AZ‑305, AZ‑700, AI‑102) or equivalent AWS/GCP certifications.
- Advanced experience with AWS (e.g., EKS, EC2, IAM, Lambda, Sage Maker) and/or Azure (e.g., AKS, VMSS, Azure ML).
- Experience with GPU orchestration and scaling strategies for AI workloads.
- Expertise with Terraform or other infrastructure‑as‑code frameworks.
- Hands‑on experience with observability stacks such as Prometheus, Grafana, Cloud Watch, and Open Telemetry.
- Experience deploying and operating generative AI workloads, including LLM inference autoscaling and RAG architectures.
- Familiarity with vector database hosting (e.g., Pinecone, Weaviate, FAISS) and model‑serving frameworks (e.g., Hugging Face TGI, vLLM, custom inference containers).
- Experience building CI/CD pipelines for LLM fine‑tuning workflows (e.g., LoRA, QLoRA, PEFT) and monitoring generative AI performance metrics such as latency, throughput, and hallucination rates.
Your exceptional people experience starts here. At Crowe, we care about our people and offer a comprehensive total rewards package.
Job DetailsFinal date to receive applications: 03/31/2026.
Wage range: $74,100.00 – $ per year…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).