AI Infrastructure & Solutions Architect Job Toronto area,Ontario Canada,IT/Tech

Date Posted: 02/24/2026
Req : 47032
Faculty/Division: Ofc of the Chief Information Officer
Department: Enterprise Infrastructure Solutions
Campus: St. George (Downtown Toronto)
Position Number:
Existing Vacancy: Yes

Description:

About us
The Enterprise Infrastructure Solutions (EIS) group, part of the Information Technology Services (ITS) division, is responsible for campus core network, campus wireless, wide area network connectivity and internet connectivity for the University, including connectivity to research and education networks. EIS is also responsible for services related to departmental network management, network, server and storage infrastructure, Windows and Linux server management services, database and application integration and support, enterprise backup service, 24/7 operation of central administrative data centers and telecommunications services.

If you’re motivated and passionate about learning technologies and dedicated to improving experiences for today’s student, consider a career with us.

Your opportunity
Reporting to the Manager, AI Engineering & Operations within the Enterprise Infrastructure Solutions group, the AI Infrastructure & Solutions Architect plays a critical role in defining the future of research and administrative computing at the University. In this role, you will lead the architectural design and deployment of secure, scalable AI platforms that serve the entire campus community. You will bridge the gap between high‑performance hardware and practical user applications, ensuring that our AI infrastructure, from GPU infrastructure to sovereign data sandboxes, are reliable, supportable, and aligned with institutional governance and ethical AI frameworks.

You will collaborate closely with AI Developers & Integration Specialists, while retaining ownership of platform architecture, reliability, security posture, and lifecycle management across on‑premises and hybrid cloud environments.

This role offers a rare opportunity to design and operate sovereign, on‑premises AI platforms at institutional scale, supporting both cutting‑edge research and mission‑critical administrative use cases.

Responsibilities

Architecting and operating container orchestration platforms (Kubernetes/K8s), including GPU operators and AI‑aware schedulers for efficient accelerator utilization.

Designing and enforcing AI platform security and governance controls, including data sovereignty, access isolation, auditability, and compliance with privacy and ethical AI frameworks.

Implementing observability and monitoring solutions to track model performance, drift, GPU utilization, inference latency, and platform health using tools such as Prometheus, Grafana, or specialized AI monitoring stacks.

Designing and operating GPU and accelerator platforms (e.g., NVIDIA, AMD, or emerging accelerators), including capacity planning, scheduling strategies, and lifecycle management.

Analyzing platform usage metrics to optimize token consumption, GPU allocation, and overall cost efficiency while maintaining performance and reliability.

Partnering with AI Developers & Integration Specialists to define platform abstractions, deployment patterns, and service interfaces that enable rapid innovation without compromising security or supportability.

Producing and maintaining architectural documentation, disaster recovery and business continuity plans, and technical guidance for researchers and platform users.

Essential Qualifications

Bachelor’s degree in Computer Science, Information Technology, Engineering, or an acceptable combination of education and equivalent experience.

Eight or more years of experience on on‑premises or cloud infrastructure management.

Three to five+ years of direct AI infrastructure or MLOps experience, recognizing the rapid evolution of the field, with demonstrated exposure to LLM platforms, RAG pipelines, or large‑scale ML systems.

Deep expertise in Infrastructure as Code (IaC) and automation for managing complex, multi‑environment platforms.

Advanced Kubernetes and container orchestration knowledge, including GPU scheduling, operators, and container runtimes (Docker, Podman).

MLOps and model…


Increase/decrease your Search Radius (miles)



Job Posting Language