Job Description & How to Apply Below
5+ years in observability, monitoring, or reliability engineering roles.
Hands-on experience with common observability tools such as Prometheus, Grafana, Splunk, Coralogix, and external monitoring tools (e.g., Catchpoint, Thousand Eyes).
Strong scripting skills in Python, plus Bash or Power Shell for automation.
Experience with Terraform and Ansible for infrastructure automation.
Solid understanding of SLIs, SLOs, error budgets, and reliability engineering principles.
Familiarity with Linux environments and distributed systems.
Design and implement a Universal Dashboard in Grafana for leadership and engineering visibility.
Ensure a consistent look and feel across all observability views.
Define and implement SLIs, SLOs, and error budgets for critical services.
Establish alerting thresholds and escalation workflows aligned with reliability goals.
Integrate anomaly detection and AI-assisted insights into the observability platform.
Contribute to self-healing workflows and automated remediation strategies.
Partner with engineering teams to instrument services with metrics, logs, and traces.
Provide documentation and best practices for observability adoption across teams.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×