DevOps/SRE; Kubernetes Job Charlottetown area,PEI Canada,IT/Tech

Position: DevOps/SRE (Kubernetes)

About the Company

We’re operating critical infrastructure that powers applications serving millions of users globally. Our platform runs on Kubernetes across multiple regions, handling high‑traffic workloads with strict SLAs for uptime and performance. We’re looking for experienced infrastructure engineers who can help us scale reliably while maintaining security and operational excellence.

The Role

We’re seeking a Senior Dev Ops/SRE Engineer to own and evolve our Kubernetes‑based infrastructure. You’ll be responsible for cluster operations, security hardening, performance optimization, and ensuring our platform can scale to meet growing demands. This role requires someone who can balance the operational needs of running production systems with the long‑term vision of building self‑healing, automated infrastructure.

You’ll work closely with product engineering teams to improve developer experience, implement robust CI/CD pipelines, and build the observability systems needed to maintain high reliability. This isn’t just about keeping the lights on—you’ll shape the infrastructure strategy and help establish best practices that enable the entire engineering organization to move faster safely.

What You’ll Do

Manage and optimize multi‑tenant Kubernetes clusters running hundreds of services across multiple AWS regions
Implement security hardening measures including network policies, pod security standards, RBAC, and secrets management
Design and maintain Infrastructure as Code using Terraform for all AWS resources and Kubernetes manifests
Build and improve CI/CD pipelines using Git Hub Actions, ArgoCD, or similar tools for automated deployments
Implement comprehensive observability using Prometheus, Grafana, Loki, and distributed tracing
Design and implement autoscaling strategies (HPA, VPA, cluster autoscaling) to handle traffic patterns efficiently
Manage service mesh configurations (Istio, Linkerd) for traffic management and security
Build disaster recovery procedures and conduct regular failure scenario testing
Optimize cloud costs through right‑size, spot instance usage, and resource efficiency improvements
Establish and maintain SLOs/SLIs for critical services, implementing alerting that minimizes noise
Participate in on‑call rotation, responding to incidents and conducting thorough post‑incident reviews
Create runbooks, documentation, and automation to reduce operational toil
Collaborate with development teams to optimize application performance and resource usage
Evaluate and integrate new infrastructure technologies that improve reliability or developer experience

What We’re Looking For

Required:

5+ years of experience in Dev Ops, SRE, or platform engineering roles
Strong proficiency with Terraform for infrastructure as code across cloud providers
Expert‑level knowledge of AWS services: EC2, EKS, RDS, S3, VPC, IAM, Cloud Watch, and more
Experience with container technologies (Docker, containerd) and container registries
Hands‑on experience implementing CI/CD pipelines with Git Ops principles
Proficiency in scripting languages (Bash, Python, Go) for automation
Strong understanding of Linux systems administration and networking fundamentals
Production experience with monitoring and observability stacks (Prometheus, Grafana, ELK/Loki)
Understanding of security best practices including secrets management (Vault, SOPS, sealed‑secrets)
Experience with service mesh technologies and their operational challenges
Proven ability to debug complex distributed systems issues
Strong incident response and post‑mortem facilitation skills
Excellent documentation and communication abilities

Nice to Have:

Experience with multi‑cloud or hybrid cloud architectures
Background with Git Ops tools (ArgoCD, Flux)
Familiarity with Helm and Kustomize for Kubernetes application management
Knowledge of eBPF‑based tools (Cilium, Pixie)
Experience with chaos engineering practices and tools (Chaos Mesh, Litmus)
Understanding of Fin Ops and cloud cost optimization strategies
Experience with compliance requirements (SOC2, HIPAA, PCI‑DSS)
Background in performance engineering and load testing
Familiarity with service mesh architectures (Istio, Linkerd, Consul)
Exper…


Increase/decrease your Search Radius (miles)



Job Posting Language

DevOps​/SRE; Kubernetes

DevOps/SRE; Kubernetes