More jobs:
Technical Lead; DevOps & Infrastructure Focus - Vice President
Job in
Mississauga, Ontario, Canada
Listed on 2026-01-11
Listing for:
PowerToFly
Full Time
position Listed on 2026-01-11
Job specializations:
-
IT/Tech
Cloud Computing, Systems Engineer
Job Description & How to Apply Below
Overview
We are seeking a highly skilled and experienced individual to fill a unique hybrid role that combines senior‑level Dev Ops and Infrastructure Engineering with the responsibilities of a Working Scrum Master. This position is for a hands‑on engineer who actively contributes to the design, implementation, and maintenance of our infrastructure and automation, while simultaneously facilitating the agile development process for their technical team.
The ideal candidate will be a strong technical leader, a passionate advocate for agile practices, and a driver of continuous improvement within a complex engineering environment. This role is for someone who thrives on both coding and coaching, with an additional understanding of the infrastructure needs and operational considerations for Artificial Intelligence and Machine Learning initiatives.
Hands‑on Dev Ops & Infrastructure Engineering
Design & Implementation:
Lead the design, implementation, and ongoing management of secure, scalable, and resilient infrastructure components.
Secret & Certificate Management:
Administer and maintain secret and certificate management solutions using Hashi Corp Vault, including policy definition and integration.
Database Management:
Perform hands‑on administration and optimization of database systems (Postgre
SQL, Oracle, Mongo
DB), including performance tuning, backup, and recovery strategies.
Workflow Orchestration:
Deploy, monitor, and troubleshoot data orchestration workflows using Apache Airflow, and develop/optimize DAGs.
Messaging Systems:
Implement and manage messaging queues such as Kafka and IBM MQ, including cluster setup and configuration.
API Integrations:
Develop, maintain, and troubleshoot RESTful API and SOAP integrations critical for system connectivity.
Build Automation:
Implement and optimize build and deployment processes using Gradle.
Container Orchestration:
Design, implement, and manage container orchestration platforms with Kubernetes and Helm, including integration with Cyber Ark and Hashi Corp for secrets management. Create, debug, and troubleshoot Kubernetes PODs, Jobs, and Deployments using YAML.
Storage Management:
Configure and manage persistent storage solutions including PVC, SONiC NAS, and S3, with an awareness of storage requirements for AI/ML workloads.
Networking & Load Balancing:
Set up and maintain load balancing solutions (e.g., Nginx, HAProxy, AWS ELB/ALB, Kubernetes Ingress controllers) for high availability and performance.
Monitoring & Logging:
Implement, configure, and utilize comprehensive monitoring and logging solutions (Prometheus, Grafana, ELK Stack) to ensure system health and proactively identify issues, including those relevant to AI/ML applications.
Automation & Scripting:
Develop robust automation scripts and tools using Python, Bash, Go, or similar languages to streamline operations and enhance efficiency.
Incident Response:
Participate actively in on‑call rotations, responding to and resolving critical incidents with hands‑on troubleshooting.
Documentation:
Create and maintain technical documentation, architecture diagrams, and runbooks for infrastructure components and processes.
Working Scrum Master & Agile Facilitation
Agile Facilitation:
Facilitate all Scrum ceremonies (Sprint Planning, Daily Scrum, Sprint Review, Sprint Retrospective) for the Dev Ops/Infrastructure engineering team.
Technical
Coaching:
Coach the team on advanced engineering practices, self‑organization, cross‑functionality, and continuous improvement in the context of infrastructure development, including support for AI/ML initiatives.
Impediment Resolution:
Proactively identify and resolve technical impediments and process bottlenecks within the team and across organizational boundaries, paying special attention to unique challenges posed by AI/ML infrastructure.
Backlog Refinement:
Collaborate closely with stakeholders (e.g., product owners, technical leads) to ensure a well‑defined and prioritized backlog for infrastructure work, technical debt, operational improvements, and AI/ML platform needs.
Process Improvement:
Drive continuous improvement in the team's agile and Dev Ops practices, helping them…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×