Senior Site Reliability Engineer - STACKIT Control Plane; m/f/d
Listed on 2026-01-19
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Location: Germany
You want to take the cloud world by storm with us STACKITEERs and shape the future of Europe with us? That's great! Then STACKIT is the right place for you. Our vision is ambitious:
An independent Europe - digital, leading. As a cloud and colocation provider, we are building the secure infrastructure for this. With our server locations exclusively in Germany and Austria, we offer both the Schwarz Group, to which we belong, and external customers a European alternative to international cloud providers and support our customers holistically with individual solutions.
As a dedicated STACKITEER, you are part of the STACKIT Products division. This is where our products and services are developed, tested and improved.
As an SRE for the STACKIT Control Plane, you guide system architecture by operating at the intersection of development and systems engineering. Together with the development team, you design, build, and run large-scale systems that are inherently scalable and reliable. Your challenges range from optimizing databases and messaging systems to refining our STACKIT services.
- You collaborate closely with development teams to shorten time-to-detect intervals by enhancing our monitoring and alerting infrastructure and ensuring our services adhere to defined SLOs.
- Your work is critical in continuously optimizing our time-to-mitigation; you achieve this by creating clear playbooks, designing dashboards for first responders, and ensuring our telemetry data (logs and metrics) is comprehensive.
- You act as a reliability consultant to development teams, educating them on reliability patterns and helping them "shift left" to foster a shared responsibility model.
- You design and refine development practices, including CI/CD pipelines, to support progressive delivery strategies such as Canary releases and Blue/Green deployments.
- You proactively analyze and optimize the scalability of the Control Plane, addressing bottlenecks in distributed consensus, database throughput, and kernel-level networking.
- You participate in a compensated on-call rotation, leading incident responses and facilitating blameless post-mortems and Root Cause Analyses.
- You bring 3+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering, with a specific focus on operating large-scale distributed systems in production.
- You possess expert-level knowledge of Kubernetes Control Plane internals, including the API Server, Controller Manager, Scheduler, and etcd.
- You demonstrate proficiency in Go and write production-grade code to build automation tools, Kubernetes Operators, or glue code that integrates disparate systems.
- You hold deep experience with Infrastructure as Code and container infrastructure, alongside proficiency in Linux system internals (kernel tuning, memory management) and networking (TCP/IP, CNI, Load Balancers, eBPF).
- You bring experience in operating data stores (e.g., Postgre
SQL, Redis) and messaging systems (e.g., Kafka, NATS) in scalable environments. - You run towards fires to learn from them, you automate yourself out of a job, and you believe that hope is not a strategy.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).