Position
Description:
We are Canada's largest independent information technology services firm, and after 45 years, we're still growing! Join Canada's largest IT Company as a "Site Reliability Engineering Consultant" in our Banking Services Division.
Location - Downtown Toronto (Hybrid - 3 days office and rest remote)
SRE will help elevate the reliability, performance, and efficiency of mission-critical batch workloads across Capital Markets Technology. Will be the technical lead for hands-on automation, application development, host systems engineering, and observability via Dynatrace, with a primary focus on optimizing batch runtimes. If you love shaving milliseconds off latency, removing toil with code, and building resilient systems that just don't fail—you'll thrive here.
This role is critical to our operational excellence strategy and will play a key role in maturing our reliability engineering practices across the Capital Markets domain.
Your future duties and responsibilities:
• Reliability & Performance:
Ensure stability and optimize batch processing pipelines; reduce runtime and failure rates, engineering for resiliency.
• Observability:
Implement and maintain monitoring with Dynatrace; create dashboards, alerts, and runbooks.
• Systems Engineering:
Manage and tune Linux and Windows systems for performance and resilience.
• Automation & Orchestration:
Create/Modify and optimize Airflow DAGs; build CI/CD pipelines for automation.
• Incident Management:
Lead incident response, root cause analysis, and postmortems; enforce SLOs and reliability practices.
• Security & Compliance:
Apply security best practices and ensure regulatory compliance in systems and automation.
Required qualifications to be successful in this role:
• Expert-level Python:
Advanced coding, performance tuning, concurrency (async/multiprocessing), testing, and packaging.
• Linux Systems Expertise:
Kernel/OS tuning, networking, filesystem optimization, process management, and troubleshooting.
• Dynatrace Mastery:
Custom dashboards, KPIs, anomaly detection, tagging strategy, and alerting configuration.
• Airflow Expertise: DAG design best practices, SLA management, scheduler/executor tuning, and scaling strategies.
• Proven experience optimizing batch workloads for performance, reliability, and cost.
• Strong understanding of distributed systems concepts retries, idempotency, backpressure, and data integrity.
• Strong understanding of backend systems and database optimization.
• Proficiency with CI/CD pipelines (Git Hub Actions, Azure Dev Ops, Jenkins) and Infrastructure as Code (Terraform, Ansible).
• Proven experience with containers and orchestration (Docker, Kubernetes).
• Excellent incident management and root cause analysis skills.
• Strong communication and collaboration abilities.
#LI-BN
Use of the term ‘engineering’ in this job posting refers to the technical sense related to Information Technology (IT) and does not imply that the individual practices engineering or possesses the requisite license as prescribed by the applicable provincial or territorial engineering regulator. We are seeking individuals with expertise in IT engineering-related functions, but licensure from an engineering regulator is not a prerequisite for this position.
Engineering is a regulated profession in Canada which is restricted in terms of use of titles and designation.
Skills:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: