Senior Site Reliability Engineer Job Toronto area,Ontario Canada,IT/Tech

What is the Opportunity?
We are seeking an experienced and skilled Senior Site Reliability Engineer and System's Specialist to join our team, responsible for ensuring the stability, reliability, and performance of our mission‑critical application.

What will you do?

Provide expert‑level support and maintenance for our mission‑critical application, ensuring high availability and performance

Collaborate with cross‑functional teams to identify and resolve technical issues, and implement preventative measures to minimize downtime

Develop and maintain automation scripts using Python or shell scripting to streamline application maintenance and deployment tasks

Design and implement Dev Ops/SRE automation solutions to improve application reliability, scalability, and efficiency

Administer and troubleshoot Linux‑based systems, including configuration, security, and performance optimization

Develop and maintain SQL scripts to support data analysis, reporting, and application functionality

Participate in on‑call rotations to provide 24/7 support for critical application issues

Collaborate with development teams to ensure smooth deployment of new features and updates

Develop and maintain technical documentation to support application maintenance and troubleshooting

Reliability & Performance Engineering

Design, implement, and maintain scalable systems with high availability, reliability, and performance

Define and monitor SLAs, SLOs, and SLIs; drive observability improvements

Conduct capacity planning, performance tuning, and system optimization

Develop and implement disaster recovery and business continuity strategies

Automation & Infrastructure as Code

Develop and maintain Infrastructure as Code (IaC) using tools like Copilot, RBC Assist etc.

Build automation for CI/CD pipelines to streamline software delivery and deployment

Automate routine operational tasks to improve efficiency and reduce human error

Create and maintain reliable deployment processes, including blue‑green and canary releases

Monitoring, Incident Response & Root Cause Analysis

Own on‑call responsibilities and develop processes to reduce alert fatigue

Lead incident response efforts, including communication and postmortem documentation

Implement and enhance monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog)

Champion blameless postmortems and drive systemic fixes to recurring issues

Collaboration, Governance & Mentorship

Collaborate closely with development, security, and operations teams to embed reliability practices

Drive SRE best practices across teams and influence architecture and design decisions

Participate in internal audits and compliance activities related to infrastructure and availability

Mentor junior SREs and contribute to internal knowledge sharing and documentation

What do you need to succeed?
Must‑Have:

Bachelor's or Master’s degree in Computer Science, Software Engineering, or a related field

3+ years’ experience with system administration in Red Hat Linux OS OR Apache, Solar

1+ years’ experience with Ansible

Strong experience with Python scripting

Experience providing Production Support

Experience with monitoring/SRE tools like Dynatrace, Pager Duty, ELK Stack

Nice to haves:

Knowledge/experience with AI (Agents, LLMs etc.)

Knowledge of Manage File Transfer platforms

Knowledge of Dev Ops tools like Git, Docker, Jenkins, and Kubernetes

What’s in it for you?

A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable

Leaders who support your development through coaching and managing opportunities

Ability to make a difference and lasting impact

Work in a dynamic, collaborative, progressive, and high‑performing team

A world‑class training program in financial services

Flexible work/life balance options

Opportunities to do challenging work

Opportunities to take on progressively greater accountabilities

Opportunities to building close relationships with clients

Job Skills
Agile Methodology, Ansible Tower, Application Infrastructure, Application Production Support, Automation, Dev Ops, Generative AI, Generative Programming, Group Problem Solving, IT Automation, IT Monitoring,…


Increase/decrease your Search Radius (miles)



Job Posting Language