Senior Site Reliability Engineer - Observability
Listed on 2026-03-02
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Job Description
Aboutthe Role: We are looking for a Senior SRE to join our Platform Engineering team astheoperations ownerof our observability platforms.
You’llbe responsible forthe reliability, scalability,and continued evolution of the tools that give our engineering organizationvisibility into everything they build and run. Thecurrentobservability platformisprimarilycomprisedofon-premises ELK(Elasticsearch, Logstash, Kibana)
Stackand Grafana, with some exposure to New Relic andSolar
Winds. This is a hybrid role:roughly halfyour timewill be spent on steady-state operations and platform support, and the other half on engineering projects that meaningfully advance the platforms you support. It’sa great fit for someone who is genuinely motivated by the pursuit of excellence – not just sustaining what worksbut relentlessly refining it.
You take pride in the platforms you own, and that pride drives you to keep improving them, whether that means tightening an SLO,eliminatinga source of toil, or building something that gives teams faster insight into their systems.
Operations & Reliability (~ 50%)
- Serve asa primary escalation point for production supportinvolvingthe ELKStack, Grafana, and New Relic
- Own platform health, capacity planning, and performance tuningfor on-premises observability infrastructure – including Elasticsearch cluster management, index lifecycle policies,and retention strategies
- Monitor and maintain SLOs for the observability platforms, ensuring the tools engineers depend on arehighly availableand performant
- Support engineering teamsin onboarding to observability platforms– helping teams instrument their applications, build dashboards, and define meaningful alerts
- Manage patching, upgrades, and configuration management across the observability stack
- Collaborate with security to harden platform configurations and manage software vulnerabilities
- Contribute to on-call rotations andmaintainrunbooks and escalation procedures
Platform Engineering (~ 50%)
- Design and build tooling/automation to reduce toil and improve the experience for teamsusing observability platforms
- Lead or contribute to platform modernization initiatives– e.g.,improving ingestion pipelines, scalingplatform capacity, standardizing Grafana dashboard and alertingpatterns, or evaluating new capabilities within the existing stack
- Develop andmaintaininfrastructure-as-code (Terraform, Helm, Ansible, etc.) for platform components
- Build and enforce standards around logging metrics and alerting that help engineering teams adopt observability best practices atscale
- Participate indesignreviews and contribute to the overall platform roadmap
- Bachelor’s degree in a technical field or equivalent practical experience
- 5+ yearsof experience in SRE, Dev Ops, or platform engineering roles
- Deep hands‑on experience with the ELK Stack–Elasticsearch cluster operations, Logstash pipeline development, Kibana, and index lifecycle management
- Strong experience with Grafana, including data source integrations, dashboard design, and alerting
- Solid understanding of observability principles
- Experience operating on-premises infrastructure, including capacity planning, server management, and the operational tradeoffs with managed cloud services
- Proficiency in Python for automation and tooling; familiarity with shell scripting
- Strong Linux systems knowledge and comfort working with configuration management tools (e.g., Ansible, Chef, Puppet, etc.)
- Demonstrated ability to drive incidents to resolution and communicate clearly under pressure
- A bias toward automation and a low tolerance for repetitive manual work
- Experience with Prometheus
- Experience with New Relic administration or APM instrumentation
- Familiarity with log shipping agents and pipeline tools such as Beats, Fluentd, or Fluent Bit
- Experience with distributed tracing tools like Open Telemetry
- Exposure to cloud-based observability offerings and experience thinking through hybrid strategies
- Prior experience building or governing observability standards across a large engineering organization
#LI-Hybrid
Dimensional offers a variety of programs to help take care of you, your…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).