Sr Staff Engineer- Availability and Incident Managment
Listed on 2026-01-25
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Overview
At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities.
Every day we honor our iconic brand by offering quality coverage to millions of customers and being there when they need us most. We thrive through relentless innovation to exceed our customers’ expectations while making a real impact for our company through our shared purpose.
When you join our company, we want you to feel valued, supported and proud to work here. That’s why we offer The GEICO Pledge:
Great Company, Great Culture, Great Rewards and Great Careers.
GEICO is seeking an experienced Engineer with a passion for building high-performance, low maintenance, zero-downtime platforms, and applications. You will help drive our insurance business transformation as we transition from a traditional IT model to a tech organization with engineering excellence as its mission, while co-creating the culture of psychological safety and continuous improvement.
Position DescriptionThe Senior Staff Engineer in Availability and Incident Management will engineer solutions and empower the engineering community with automated processes, data-driven insights, and technical tools that reduce incident recurrence, improve system reliability, and accelerate incident resolution. This role will be heavily centered around building automation platforms to streamline postmortem workflows, eliminate manual tracking, and provide fast feedback loops for incident prevention.
You will lead the strategy and execution of a technical roadmap that increases the velocity of incident resolution, reduces repeat incidents, and unlocks new reliability engineering capabilities. The ideal candidate has broad and deep technical knowledge in incident forensics, root cause analysis, automation platforms, distributed systems, observability, and data analytics.
As a Senior Staff Engineer, you will:
- Lead the strategy and execution for incident retrospective and correction of error (COE) processes across the engineering organization
- Help conduct deep technical root cause analysis and incident forensics across distributed systems using observability data, logs, metrics, and traces
- Establish continuous improvement loops through automated trend analysis, pattern recognition algorithms, and predictive analytics
- Design, code, and deploy automation platforms and self-service tools using Python, Go, Java, or C# that scale incident retrospective workflows and eliminate manual tracking
- Build production-grade data pipelines, analytics systems, and real-time dashboards to measure incident trends, COE effectiveness, and action item completion rates
- Write code for workflow automation, integrations with observability platforms, and APIs that connect incident management tools across the engineering ecosystem
- Leverage SQL and No
SQL databases to store, query, and analyze incident data at scale using Azure tools and cloud-native services - Develop and maintain systems that ensure rigorous follow-through on action items, remediation plans, and preventive measures with automated tracking
- Partner with service engineering teams to implement preventive measures and architectural improvements based on incident patterns
- Present data-driven insights and incident trend analysis to leadership and engineering teams to drive preventive action
- Influence and educate leadership on incident patterns, prevention strategies, and reliability best practices
- Mentor engineers on coding best practices, automation techniques, and strengthen technical expertise across the engineering community
- Stay current with industry advances in SRE, observability, incident management, and automation; educate teams on emerging practices
- Experience building automation platforms and self-service tools for workflow management, analytics, or engineering productivity
- Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design
- Experience building microservices architectures, REST APIs, and distributed systems
- Experience with data pipelines, analytics platforms, and visualization tools for operational metrics and KPIs
- Experience with SQL and No
SQL databases (e.g., Postgre
SQL, Mongo
DB, Cassandra, Cosmos
DB) for data storage and analytics - Experience with observability platforms (Prometheus, Grafana, Datadog, Splunk, ELK) and distributed systems monitoring, logging, and tracing
- Experience with cloud providers (Azure, AWS, or GCP) and cloud-native architectures
- Experience with CI/CD pipelines, infrastructure as code, and container orchestration (Kubernetes, Docker)
- Experience writing workflow automation code (YAML pipelines, Git Hub Actions, Azure Dev Ops pipelines)
- Strong understanding of distributed systems architecture, design patterns, reliability, and scaling
- Knowledge of retrospective facilitation, continuous improvement processes, and blameless culture principles
- Strong architecture and design skills with ability to influence…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).