Chaos Engineering Expert
Listed on 2026-01-16
-
IT/Tech
Systems Engineer, Cybersecurity -
Engineering
Systems Engineer, Cybersecurity
Position Title :
Chaos Engineering Expert (Resiliency & Performance Engineer)
Location :
Riyadh / Hybrid (On-prem + Off-shore)
Employment Type
:
Full-time
Reports to :
Head of Infrastructure / SRE / Platform Engineering
We are seeking an experienced Chaos Engineering Expert to help strengthen the resiliency, performance, and security posture of our hybrid infrastructure. In this role, you will design, execute, and analyze chaos experiments across our on-premises servers, databases, and application services, and work collaboratively with our team to embed resilience into our systems. Deliver an insight maturity report including improvements recommendations for system architecture, operational processes, and incident response, enabling us to anticipate and mitigate failures before they impact customers.
Responsibilities- Design, plan, and implement chaos engineering experiments across all layers of our infrastructure (physical / virtual servers, network, storage, databases, applications, and services).
- Develop hypotheses (failure scenarios), define metrics, and create success criteria for experiments.
- Execute fault-injection / chaos tests (either in pre-production, staging, or controlled production environments), ensuring minimal risk to business operations.
- Monitor and instrument system behavior during experiments using observability tools.
- Analyze the results of experiments, identify vulnerabilities, failure modes, and weak points; derive actionable recommendations.
- Collaborate with Dev Ops, SRE, DBAs, security, network, BCM and operations teams to remediate issues uncovered by experiments and comply with systems RTO.
- Integrate chaos experiments into the CI / CD pipeline or as part of release / reliability practices.
- Build a chaos framework suitable for our hybrid environment.
- Document all experiments, including design, configuration, execution details (drills), results, lessons learned, and corrective actions.
- Develop and maintain runbooks, playbooks, and operational procedures for resilience testing.
- Participate in post-incident reviews, injecting learnings from chaos experiments into incident response and root cause analysis.
- 6+ years of experience in site reliability engineering (SRE), performance engineering, and infrastructure engineering.
- Proven track record of designing and executing fault injection, resilience testing, chaos experiments.
- Deep understanding of on-premises infrastructure : physical and virtual servers, hypervisors, networking, storage.
- Experience with database systems (e.g., SQL, No
SQL) and how they fail / recover. - Familiarity with application stacks, microservices, and distributed architectures.
- Proficiency in one or more languages used for automation or scripting (e.g., Python, Go, Java, or similar).
- Hands-on experience with tools such as Chaos Monkey, Gremlin, Chaos Mesh, Litmus Chaos, Toxiproxy, AWS Fault Injection Simulator (FIS), Azure Chaos Studio, or similar.
- Strong skills in monitoring, metrics, logging, and tracing (e.g., Open Text Site Scope, Datadog,).
- Experience integrating chaos testing into CI / CD pipelines and infrastructure-as-code workflows.
- Good understanding of security vulnerabilities and how fault injection might surface security risks.
- Familiar with risk assessment, threat modeling, or security hardening practices.
- Ability to work across teams (Dev Ops, DBAs, Ops, Security) and communicate complex findings in a clear manner.
- Strong documentation skills — proven ability to write detailed experiment designs, results, remediations, and technical playbooks.
- Strong analytical mindset, capable of interpreting results, identifying root causes, and recommending mitigations.
Education :
Bachelor’s degree in computer science, Engineering, or a related technical field.
- Certifications or formal training in chaos engineering (or resilience engineering) is a must.
- Chaos Engineering Fundamentals certificate is preferred.
- Certifications in SRE, or Dev Ops (e.g., Gitlab, Azure, Google Cloud, Kubernetes) are beneficial.
- Security certifications are a plus as security vulnerability testing is part of chaos experiments.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).