Sr Staff Engineer- Availability and Incident Managment Job Palo Alto area,California USA,IT/Tech

Overview

At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities.

Every day we honor our iconic brand by offering quality coverage to millions of customers and being there when they need us most. We thrive through relentless innovation to exceed our customers’ expectations while making a real impact for our company through our shared purpose.

When you join our company, we want you to feel valued, supported and proud to work here. That’s why we offer The GEICO Pledge:
Great Company, Great Culture, Great Rewards and Great Careers.

Position Summary

GEICO is seeking an experienced Engineer with a passion for building high-performance, low maintenance, zero-downtime platforms, and applications. You will help drive our insurance business transformation as we transition from a traditional IT model to a tech organization with engineering excellence as its mission, while co-creating the culture of psychological safety and continuous improvement.

Position Description

The Senior Staff Engineer in Availability and Incident Management will engineer solutions and empower the engineering community with automated processes, data-driven insights, and technical tools that reduce incident recurrence, improve system reliability, and accelerate incident resolution. This role will be heavily centered around building automation platforms to streamline postmortem workflows, eliminate manual tracking, and provide fast feedback loops for incident prevention.

You will lead the strategy and execution of a technical roadmap that increases the velocity of incident resolution, reduces repeat incidents, and unlocks new reliability engineering capabilities. The ideal candidate has broad and deep technical knowledge in incident forensics, root cause analysis, automation platforms, distributed systems, observability, and data analytics.

Position Responsibilities

As a Senior Staff Engineer, you will:

Lead the strategy and execution for incident retrospective and correction of error (COE) processes across the engineering organization
Help conduct deep technical root cause analysis and incident forensics across distributed systems using observability data, logs, metrics, and traces
Establish continuous improvement loops through automated trend analysis, pattern recognition algorithms, and predictive analytics
Design, code, and deploy automation platforms and self-service tools using Python, Go, Java, or C# that scale incident retrospective workflows and eliminate manual tracking
Build production-grade data pipelines, analytics systems, and real-time dashboards to measure incident trends, COE effectiveness, and action item completion rates
Write code for workflow automation, integrations with observability platforms, and APIs that connect incident management tools across the engineering ecosystem
Leverage SQL and No

SQL databases to store, query, and analyze incident data at scale using Azure tools and cloud-native services
Develop and maintain systems that ensure rigorous follow-through on action items, remediation plans, and preventive measures with automated tracking
Partner with service engineering teams to implement preventive measures and architectural improvements based on incident patterns
Present data-driven insights and incident trend analysis to leadership and engineering teams to drive preventive action
Influence and educate leadership on incident patterns, prevention strategies, and reliability best practices
Mentor engineers on coding best practices, automation techniques, and strengthen technical expertise across the engineering community
Stay current with industry advances in SRE, observability, incident management, and automation; educate teams on emerging practices

Qualifications

Experience building automation platforms and self-service tools for workflow management, analytics, or engineering productivity
Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design
Experience building microservices architectures, REST APIs, and distributed systems
Experience with data pipelines, analytics platforms, and visualization tools for operational metrics and KPIs
Experience with SQL and No

SQL databases (e.g., Postgre

SQL, Mongo

DB, Cassandra, Cosmos

DB) for data storage and analytics
Experience with observability platforms (Prometheus, Grafana, Datadog, Splunk, ELK) and distributed systems monitoring, logging, and tracing
Experience with cloud providers (Azure, AWS, or GCP) and cloud-native architectures
Experience with CI/CD pipelines, infrastructure as code, and container orchestration (Kubernetes, Docker)
Experience writing workflow automation code (YAML pipelines, Git Hub Actions, Azure Dev Ops pipelines)
Strong understanding of distributed systems architecture, design patterns, reliability, and scaling
Knowledge of retrospective facilitation, continuous improvement processes, and blameless culture principles
Strong architecture and design skills with ability to influence…


Increase/decrease your Search Radius (miles)



Job Posting Language