Devops Engineer
Listed on 2026-02-28
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Job title: Devops Engineer
Location: London; full in-office working as default
Start date: ASAP
Reports to: CTO
Compensation: £60 - 90k + Equity
Cosine at a glanceAt Cosine, we’re building autonomous AI engineers that plan, write, and ship code inside real development workflows.
Cosine is designed for on-premise and virtual private cloud (VPC) deployments, including fully air-gapped environments. We build our agent tooling entirely in-house and post-train open-source models to deliver reliable, enterprise-grade coding performance in security-critical settings.
In 2024, Cosine achieved a 72% score on OpenAI’s SWE-Lancer benchmark, placing us among the strongest real-world software-engineering AI systems evaluated.
YC-backed and well-funded, Cosine was founded by experienced operators focused on building dependable, production-grade AI.
This role is based in our Hoxton office, five days a week, because close collaboration, fast feedback, and shared context matter for the problems we’re solving.
The roleWe’re looking for a Devops / Senior Platform / Infra Engineer to own the core infrastructure that powers Cosine’s products — from Kubernetes and deployment pipelines to networking and platform services.
You’ll design and run the “paved road” that our engineers, researchers, and customers build on: reliable Kubernetes clusters, fast and safe CI/CD, solid observability, and hardened environments for demanding enterprise and on-prem deployments. You’ll also wear a classic “Dev Ops/SRE” hat: thinking in SLOs, running incident response, and keeping us up even as we move quickly.
This is a high-ownership role at a fast-paced, venture-backed Silicon Valley startup. You’ll work directly with founding engineers and leadership, and your decisions will materially shape how we build and ship products.
What you’ll doOwn core infrastructure
Design, operate, and evolve our Kubernetes-based platform (EKS or similar), including cluster topology, node groups, autoscaling, and multi-environment isolation.
Manage supporting cloud resources: container registries, load balancers, queues, caches, and data infra needed to run our APIs and agents.
Build the deployment & tooling layer
Design and maintain CI/CD pipelines for image builds and infra rollouts (e.g. Pulumi/Terraform + Helm/Docker).
Implement safe rollout strategies (blue/green, canary, staged rollouts) and fast rollback paths.
Build internal tools and abstractions that make it easy for product teams to self-serve infra safely.
Own reliability & operations (SRE-ish)
Define and track SLOs/SLIs for key services (latency, error rates, availability).
Improve our observability stack (metrics, logs, traces, alerts) so issues are obvious, actionable, and debuggable.
Participate in the on-call rotation, lead incident response when needed, and drive blameless post-mortems and fixes.
Shape networking & security
Design and maintain networking: VPCs, subnets, ingress/egress, service meshes / L7 routing, DNS, and TLS.
Implement least-privilege access via IAM, secure secret management, and hardened configurations for multi-tenant and isolated customer environments.
Help design patterns for secure enterprise and on-prem / regulated deployments.
Partner with product & research
Work closely with application, ML, and research teams to understand their needs and translate them into reusable infra building blocks.
Provide guidance on “how to run this in production” — capacity planning, failure modes, and operational readiness reviews.
Have strong experience
5+ years building and operating production infrastructure on a major cloud (AWS, GCP, or Azure).
Significant hands‑on experience running Kubernetes in production (EKS/GKE/AKS or self-managed):
Cluster upgrades, autoscaling, node group design, and multi‑env setups.
Helm or similar for packaging services.
Think in infrastructure-as-code
Deep experience with IaC tools (Pulumi, Terraform, CDK, or similar).
Comfortable managing infra changes via code review, CI, and automated rollouts.
Care deeply about reliability
Have owned the uptime and performance of user-facing systems.
Comfortable participating in (and improving) on‑call rotations and…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: