×
Register Here to Apply for Jobs or Post Jobs. X

DevOps​/SRE; Kubernetes

Job in Charlottetown, PEI, Canada
Listing for: Decode Talent
Full Time position
Listed on 2026-03-01
Job specializations:
  • IT/Tech
    Cloud Computing, SRE/Site Reliability, Systems Engineer
Job Description & How to Apply Below
Position: DevOps/SRE (Kubernetes)

About the Company

We’re operating critical infrastructure that powers applications serving millions of users globally. Our platform runs on Kubernetes across multiple regions, handling high‑traffic workloads with strict SLAs for uptime and performance. We’re looking for experienced infrastructure engineers who can help us scale reliably while maintaining security and operational excellence.

The Role

We’re seeking a Senior Dev Ops/SRE Engineer to own and evolve our Kubernetes‑based infrastructure. You’ll be responsible for cluster operations, security hardening, performance optimization, and ensuring our platform can scale to meet growing demands. This role requires someone who can balance the operational needs of running production systems with the long‑term vision of building self‑healing, automated infrastructure.

You’ll work closely with product engineering teams to improve developer experience, implement robust CI/CD pipelines, and build the observability systems needed to maintain high reliability. This isn’t just about keeping the lights on—you’ll shape the infrastructure strategy and help establish best practices that enable the entire engineering organization to move faster safely.

What You’ll Do
  • Manage and optimize multi‑tenant Kubernetes clusters running hundreds of services across multiple AWS regions
  • Implement security hardening measures including network policies, pod security standards, RBAC, and secrets management
  • Design and maintain Infrastructure as Code using Terraform for all AWS resources and Kubernetes manifests
  • Build and improve CI/CD pipelines using Git Hub Actions, ArgoCD, or similar tools for automated deployments
  • Implement comprehensive observability using Prometheus, Grafana, Loki, and distributed tracing
  • Design and implement autoscaling strategies (HPA, VPA, cluster autoscaling) to handle traffic patterns efficiently
  • Manage service mesh configurations (Istio, Linkerd) for traffic management and security
  • Build disaster recovery procedures and conduct regular failure scenario testing
  • Optimize cloud costs through right‑size, spot instance usage, and resource efficiency improvements
  • Establish and maintain SLOs/SLIs for critical services, implementing alerting that minimizes noise
  • Participate in on‑call rotation, responding to incidents and conducting thorough post‑incident reviews
  • Create runbooks, documentation, and automation to reduce operational toil
  • Collaborate with development teams to optimize application performance and resource usage
  • Evaluate and integrate new infrastructure technologies that improve reliability or developer experience
What We’re Looking For

Required:

  • 5+ years of experience in Dev Ops, SRE, or platform engineering roles
  • Strong proficiency with Terraform for infrastructure as code across cloud providers
  • Expert‑level knowledge of AWS services: EC2, EKS, RDS, S3, VPC, IAM, Cloud Watch, and more
  • Experience with container technologies (Docker, containerd) and container registries
  • Hands‑on experience implementing CI/CD pipelines with Git Ops principles
  • Proficiency in scripting languages (Bash, Python, Go) for automation
  • Strong understanding of Linux systems administration and networking fundamentals
  • Production experience with monitoring and observability stacks (Prometheus, Grafana, ELK/Loki)
  • Understanding of security best practices including secrets management (Vault, SOPS, sealed‑secrets)
  • Experience with service mesh technologies and their operational challenges
  • Proven ability to debug complex distributed systems issues
  • Strong incident response and post‑mortem facilitation skills
  • Excellent documentation and communication abilities

Nice to Have:

  • Experience with multi‑cloud or hybrid cloud architectures
  • Background with Git Ops tools (ArgoCD, Flux)
  • Familiarity with Helm and Kustomize for Kubernetes application management
  • Knowledge of eBPF‑based tools (Cilium, Pixie)
  • Experience with chaos engineering practices and tools (Chaos Mesh, Litmus)
  • Understanding of Fin Ops and cloud cost optimization strategies
  • Experience with compliance requirements (SOC2, HIPAA, PCI‑DSS)
  • Background in performance engineering and load testing
  • Familiarity with service mesh architectures (Istio, Linkerd, Consul)
  • Exper…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary