SRE DevOps Engineer
Listed on 2026-02-28
-
IT/Tech
Systems Engineer, SRE/Site Reliability
SRE Dev Ops Engineer
Location:
Overland Park, KS / Atlanta, GA / Frisco, TX (Onsite)
- 4–9 years in SRE/Dev Ops/Systems Engineering as Senior or Principal Engineer
- Strong hands‑on experience with Kubernetes, container orchestration, and API management.
- Working knowledge of WAFs, networking security, and database technologies (SQL/No
SQL). - Proficient in automation and scripting (Python, Go, Ansible, Terraform, etc.)
- Strong observability/monitoring experience.
- Experience with CI/CD pipelines, Git Ops, and infrastructure as code.
- Solid problem‑solving and collaboration skills.
- Resolve escalated incidents across Kubernetes, API Proxy, WAF, DBs, and infra platforms.
- Design and improve runbooks, automating manual steps wherever possible.
- Lead and contribute to building self‑healing systems and self‑service tooling for users.
- Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
- Collaborate with engineering teams on deployment, upgrades, and performance optimization.
- Conduct post‑mortems, document RCA, and ensure learning is captured.
- Mentor and coach L1 engineers.
Expectation:
Diagnose and resolve escalated incidents that L1 cannot handle, often across multiple layers (infrastructure, application, network).
Example:
For an API outage, identify if the root cause is in Kubernetes pod networking, API gateway misconfig, or backend DB latency — and apply fixes.
Expectation:
Comfortable with deployments, scaling, networking, and debugging cluster‑level issues.
Example:
Troubleshoot why pods are pending by checking node capacity, taints/tole rations, and cluster autoscaler logs.
Expectation:
Write scripts and automation to reduce manual toil, enhance monitoring, and improve incident resolution speed.
Example:
Develop a Python script to automatically collect pod and system logs when a service crashes.
Expectation:
Deep understanding of monitoring, alerting, tracing, and logging systems.
Example:
Build Prometheus alert rules to detect DB query spikes; configure Grafana dashboards for API latency.
Expectation:
Familiarity with Git Ops workflows, CI/CD pipelines, and infrastructure provisioning.
Example:
Enhance Jenkins pipeline to add automated smoke tests before promoting Kubernetes deployments.
SQL)
Expectation:
Identify performance bottlenecks, connection issues, and basic tuning opportunities.
Example:
Run queries to detect slow‑running SQL statements causing latency in an application.
Expectation:
Act as incident commander for escalated issues, lead bridge calls, and produce Root Cause Analyses.
Example:
After a WAF misconfiguration causes downtime, lead the investigation, document the timeline, and propose preventive actions.
Expectation:
Coach L1 engineers, refine runbooks, and introduce new automated workflows.
Example:
Update a runbook to add automated Kubernetes log collection instead of manual steps.
Expectation:
Hands‑on skills in provisioning, scaling, and securing cloud workloads.
Example:
Diagnose why an AWS ALB is misrouting traffic after a deployment.
Expectation:
Understand WAF rules, common attacks (SQL injection, XSS), and how to apply fixes.
Example:
Investigate false positives in WAF logs and adjust rule sets with security teams.
Expectation:
Anticipate scaling needs, tune resource utilization, and propose optimizations.
Example:
Identify that a Kubernetes deployment is CPU‑throttled and adjust HPA (Horizontal Pod Autoscaler) configs.
Expectation:
Integrate AI/ML‑powered tools for anomaly detection and auto‑remediation.
Example:
Implement a Chat Ops bot that runs predefined Kubernetes troubleshooting commands in Slack.
Expectation:
Experience supporting both on‑prem and cloud environments seamlessly.
Example:
Compare latency patterns between on‑prem DBs and cloud‑hosted APIs to identify bottlenecks.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).