Job Site Reliability Engineer; f/m/d,Jobs Berlin Berlin,Stellenangebote in Deutschland,IT/Informationstechnik,IONOS Group

Stellenbezeichnung: Site Reliability Engineer (f/m/d)

We are seeking a highly skilled and experienced Site Reliability Engineer to join our team working on a 24/7 shift basis. The Site Reliability Engineering L2 department operates all IONOS Cloud IaaS and PaaS services. As a Site Reliability Engineer, you will be responsible for ensuring the stability, security, and performance of our complex and distributed systems. You will work closely with our development teams to design, implement, and maintain scalable and reliable infrastructure, and to automate and optimize our systems and processes.

Responsibilities

Provide Technical level 2 support with direct customer contact.
Maintain monitoring, logging, and alerting solutions using tools such as Prometheus, Grafana, and Loki, to proactively detect blockers in shift rotation and contribute to resolving complex issues in distributed systems.
Troubleshoot network (LAN/WAN/VPN, DNS, DHCP) and storage systems (file/object/block), including provision, operation of highly available services on Linux and Kubernetes with Helm Charts.
Maintain Infrastructure as Code, automation and playbooks using tools such as Ansible, Terraform, Git Lab CI/CD, ArgoCD, and scripting languages like Bash, Python, and Go.
Collaborate with development teams to enhance processes and deployments, and to ensure smooth integration of new services and applications into our cloud and Kubernetes environment.
Ensure the stable and secure operation of our platforms, including management of incidents end-to-end, from initial analysis to resolution and follow-up through Problem Management.

Qualifications

Willingness to work in a 24x7 shift model that includes nights, weekends, and holidays with a strong problem‑solving and troubleshooting approach to resolve complex technical problems.
You have multiple years of experience as a Site Reliability Engineer or in a related role (Linux System Administrator, Platform Engineer, Dev Ops/Infrastructure Engineer, Full Stack Developer).
Strong experience with automation tools (e.g., Ansible, Salt Stack), monitoring and observability tools (e.g., Prometheus, Grafana, Loki), and logging and alerting solutions (e.g., ELK Stack).
Strong experience with virtualized environments, including Qemu/KVM, Open Stack, Proxmox, Cloud Storage technologies (File, Object, Block) and proficient knowledge of Docker & Kubernetes (K8s).
Proficiency in at least one programming or scripting language (e.g., Go, Python, Bash) for automation and monitoring tasks.
Experience with code management is required, with knowledge of merge conflicts, feature branches, merge requests, and continuous integration (CI/CD) being a plus.

Nice to Have

Experience with RDMA, Infini Band, and RoCE protocols.
Strong experience with Linux MD RAID (mdadm, sedadm) and LVM.
Proficiency in Linux performance tuning and network stack debugging (e.g., ethtool, perf, tcpdump, ibstat, ibtop).
Experience with S3, Ceph and software‑defined networks.
Experience with established software development practices, including code reviews, build processes, packaging, and testing.

Language Skills

Must be fluent in German and English. At least B2 CEFR Level.

Location

Berlin

At the end of the application process, candidates must undergo a security check. Your consent will be requested in good time during the process.

Benefits

Hybrid working model.
Shift working hours.
At some locations a subsidized canteen and various free drinks.
Modern office space with very good transport connections.
Various employee discounts for activities and products.
Employee events such as summer and winter parties, as well as workshops.
Numerous training and development opportunities.
Various health offers, such as sports and health courses.

#J-18808-Ljbffr


Suchradius erweitern (Meilen)



Sprache der Stellenausschreibung

Site Reliability Engineer; f​/m​/d

Site Reliability Engineer; f/m/d