More jobs:
QA Engineer
Job in
Santa Clara, Santa Clara County, California, 95053, USA
Listed on 2026-01-17
Listing for:
InfraCloud Technologies
Full Time
position Listed on 2026-01-17
Job specializations:
-
IT/Tech
SRE/Site Reliability, IT Support
Job Description & How to Apply Below
Responsibilities
- Test product‑specific use cases and validate end‑to‑end alerting workflows across monitoring systems.
- Simulate incidents and test scenarios that trigger alerts in tools like Datadog, Prometheus, or similar monitoring platforms.
- Verify that alerts raised in monitoring tools are correctly consumed and acted upon by downstream systems or automated workflows.
- Understand alert rules so test cases are easier to design, execute, debug, and maintain (alert configuration will be handled by Developers/SREs, but QA must understand them).
- Collaborate closely with engineering teams (Developers, SREs/Dev Ops) to improve detection, investigation, and automated incident response.
- Analyse alert behaviour, validate incident pipelines, and ensure seamless integration across all monitoring and automation tools.
- Identify gaps in monitoring, logging, and alert workflows and provide clear, actionable QA feedback.
- Document test scenarios, alert behaviour, and monitoring workflows in a clear and reproducible manner.
- Monitoring Tools Expertise:
Hands‑on experience with at least one major monitoring system (Datadog or Prometheus), including working with alerts, dashboards, and troubleshooting. - Alert Simulation and Validation:
Ability to trigger, simulate, and validate alert events end‑to‑end. - Incident Workflow Understanding:
Strong understanding of how alerts propagate through monitoring systems and how automated systems respond to them. - Automation Mindset:
Ability to use or write simple scripts (Python, Shell, etc.) to simulate workloads or events that trigger alerts. - Communication and Problem Solving:
Ability to collaborate effectively with Developers and SRE/Dev Ops teams to ensure monitoring accuracy. - Experience with automated incident investigation or remediation tools.
- Familiarity with CI/CD pipelines and integrating monitoring validation into pipelines.
- Understanding of observability fundamentals, metrics, logs, and traces.
- Exposure to infrastructure or SRE environments.
- Basic knowledge of Kubernetes, Docker, or cloud platforms (AWS/GCP/Azure).
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×