×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer – GenAI Platform

Job in Mirabel, Montréal, Province de Québec, Canada
Listing for: Astra North Infoteck Inc.
Full Time position
Listed on 2026-02-28
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Job Description & How to Apply Below
Location: Mirabel

Job Description
  • Experience:

    8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

  • Roles and Responsibilities:

    • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

    • Design and build automation for core platform capabilities, reducing manual toil

    • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

    • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards

    • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

    • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

    • Optimize cost vs. performance tradeoffs in large-scale compute environments

    • Harden systems for security, compliance, auditability, and data governance

    • Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems

    • Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms

    • Maintain runbooks, operational playbooks, documentation, and training materials

    • Participate in on-call rotations and respond to production incidents 24/7 as needed

    • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

  • Skills:

    • Production experience in SRE / Infrastructure / ops for large-scale systems

    • Strong programming/scripting skills (Python, Go, Java, or equivalent)

    • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

    • Infrastructure-as-code (Terraform, Helm, Cloud Formation, Ansible, etc.)

    • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

    • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

    • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)

    • Solid experience in capacity planning, performance tuning, scaling, and incident response

    • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments

    • Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus

    • Excellent communication, documentation, and cross-team collaboration skills

    • Proven track record of reducing operational toil via automation

  • Requirements
    Android and iOS
    Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
    To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)

    Job Posting Language
    Employment Category
    Education (minimum level)
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary