Senior Site Reliability Engineer Job Toronto area,Ontario Canada,IT/Tech

What is the Opportunity?

We are seeking an experienced and skilled Senior Site Reliability Engineer and System's Specialist to join our team, responsible for ensuring the stability, reliability, and performance of our mission‑critical application.

What will you do?

Provide expert‑level support and maintenance for our mission‑critical application, ensuring high availability and performance
Collaborate with cross‑functional teams to identify and resolve technical issues, and implement preventative measures to minimize downtime
Develop and maintain automation scripts using Python or shell scripting to streamline application maintenance and deployment tasks
Design and implement Dev Ops/SRE automation solutions to improve application reliability, scalability, and efficiency
Administer and troubleshoot Linux‑based systems, including configuration, security, and performance optimization
Develop and maintain SQL scripts to support data analysis, reporting, and application functionality
Participate in on‑call rotations to provide 24/7 support for critical application issues
Collaborate with development teams to ensure smooth deployment of new features and updates
Develop and maintain technical documentation to support application maintenance and troubleshooting
Reliability & Performance Engineering
Design, implement, and maintain scalable systems with high availability, reliability, and performance
Define and monitor SLAs, SLOs, and SLIs; drive observability improvements
Conduct capacity planning, performance tuning, and system optimization
Develop and implement disaster recovery and business continuity strategies
Automation & Infrastructure as Code
Develop and maintain Infrastructure as Code (IaC) using tools like Copilot, RBC Assist etc.
Build automation for CI/CD pipelines to streamline software delivery and deployment
Automate routine operational tasks to improve efficiency and reduce human error
Create and maintain reliable deployment processes, including blue‑green and canary releases
Monitoring, Incident Response & Root Cause Analysis
Own on‑call responsibilities and develop processes to reduce alert fatigue
Lead incident response efforts, including communication and postmortem documentation
Implement and enhance monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog)
Champion blameless postmortems and drive systemic fixes to recurring issues
Collaboration, Governance & Mentorship
Collaborate closely with development, security, and operations teams to embed reliability practices
Drive SRE best practices across teams and influence architecture and design decisions
Participate in internal audits and compliance activities related to infrastructure and availability
Mentor junior SREs and contribute to internal knowledge sharing and documentation

What do you need to succeed? Must‑Have:

Bachelor's or Master’s degree in Computer Science, Software Engineering, or a related field
3+ years’ experience with system administration in Red Hat Linux OS OR Apache, Solar
1+ years’ experience with Ansible
Strong experience with Python scripting
Experience providing Production Support
Experience with monitoring/SRE tools like Dynatrace, Pager Duty, ELK Stack

Nice to haves:

Knowledge/experience with AI (Agents, LLMs etc.)
Knowledge of Manage File Transfer platforms
Knowledge of Dev Ops tools like Git, Docker, Jenkins, and Kubernetes

What’s in it for you?

A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable
Leaders who support your development through coaching and managing opportunities
Ability to make a difference and lasting impact
Work in a dynamic, collaborative, progressive, and high‑performing team
A world‑class training program in financial services
Flexible work/life balance options
Opportunities to do challenging work
Opportunities to take on progressively greater accountabilities
Opportunities to building close relationships with clients

Job Skills

Agile Methodology, Ansible Tower, Application Infrastructure, Application Production Support, Automation, Dev Ops, Generative AI, Generative Programming, Group Problem Solving, IT Automation, IT…


Increase/decrease your Search Radius (miles)



Job Posting Language