Senior Engineer
Listed on 2026-03-05
-
IT/Tech
Systems Engineer, SRE/Site Reliability
Location:
Overland Park, KS / Atlanta, GA / Frisco, TX (Onsite)
Requirements
Qualifications:
- 4–9 years in SRE/Dev Ops/Systems Engineering as Senior or Principal Engineer
- Strong hands‑on experience with Kubernetes, container orchestration
, and API management. - Working knowledge of WAFs, networking security, and database technologies (SQL/No
SQL). - Proficient in automation and scripting (
Python, Go,Ansible, Terraform, etc.) - Strong observations/monitoring experience.
- Experience with CI/CD pipelines, Git Ops, and infrastructure as code.
- Solid problem‑solving and collaboration skills.
Job responsibilities:
- Resolve escalated incidents across Kubernetes,API Proxy, WAF,DBs, and infra platforms.
- Design and improve runbooks, automating manual steps wherever possible.
- Lead and contribute to building self‑healing systems and self‑service tooling for users.
- Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
- Collaborate with engineering teams on deployment, upgrades, and performance optimization.
- Conduct postmortems, document RCA, and ensure learning is captured.
- Mentor and coach L1 engineers.
Skills
Mandatory Skills (Must-Have)
- Advanced Incident Troubleshooting & Resolution
Expectation:
Diagnose and resolve escalated incidents that L1 cannot handle, often across multiple layers (infrastructure, application,network).
Example:
For an API outage,identify if the root cause is in Kubernetes pod networking, API gateway mis-config,or back-end DB latency — and apply fixes.
Expectation:
Comfortable with deployments, scaling,networking, and debugging cluster level issues.
Example:
Troubleshoot why pods are pending by checking node capacity, taints/tole rations, and cluster auto scaler logs.
- Automation & Scripting (Python, Go, Bash,Ansible, Terraform)
Expectation:
Write scripts and automation to reduce manual toil, enhance monitoring, and improve incident resolution speed.
Example:
Develop a Python script to automatically collect pod and system logs when a service crashes.
- Observability & Monitoring Tooling
Expectation:
Deep understanding of monitoring, alerting, tracing, and logging systems.
Example:
Build Prometheus alert rules to detect DB query spikes; configure Grafana dashboards for API latency.
- CI/CD & Infrastructure as Code (IaC)
Expectation:
Familiarity with Git Ops workflows, CI/CD pipelines, and infrastructure provisioning.
Example:
Enhance Jenkins pipeline to add automated smoke tests before promoting Kubernetes deployments.
- Database Troubleshooting (SQL & No
SQL)
Expectation:
Identify performance bottlenecks, connection issues, and basic tuning opportunities.
Example:
Run queries to detect slow-running SQL statements causing latency in an application.
Expectation:
Act as incident commander for escalated issues, lead bridge calls, and produce Root Cause Analyses.
Example:
After a WAF misconfiguration causes downtime,lead the investigation, document the timeline, and propose preventive actions.
Expectation:
Coach L1 engineers, refine runbooks, and introduce new automated workflows.
Example:
Update a runbook to add automated Kubernetes log collection instead of manual steps.
Preferred Skills (Nice-to-Have)
Expectation:
Hands-on skills in provisioning, scaling, and securing cloud workloads.
Example:
Diagnose why an AWS ALB is misrouting traffic after a deployment.
- Security & WAF Management
Expectation:
Understand WAF rules, common attacks (SQL injection, XSS), and how to apply fixes.
Example:
Investigate false positives in WAF logs and adjust rule sets with security teams.
- Capacity & Performance Engineering
Expectation:
Anticipate scaling needs, tune resource utilization, and propose optimizations.
Example:
Identify that a Kubernetes deployment is CPU‑throttled and adjust HPA (Horizontal Pod Autoscaler) configs.
- Automation Platform Integration (AIOps, Chat Ops)
Expectation:
Integrate AI/ML-powered tools for anomaly detection and auto‑remediation.
Example:
Implement a Chat Ops bot that runs predefined Kubernetes troubleshooting commands in Slack.
Expectation:
Experience supporting both on‑prem and cloud environments seamlessly.
Example:
Compare latency patterns between on‑prem DBs and cloud‑hosted APIs to identify bottlenecks.
Qualifications:
- 7+ years in SRE/Dev Ops/Systems Engineering as Senior or Principal Engineer
- Strong hands‑on experience with Kubernetes, container orchestration, and API management.
- Working knowledge of WAFs, networking security, and database technologies (SQL/No
SQL). - Proficient in automation and scripting (Python, Go, Ansible, Terraform, etc.)
- Strong observability/monitoring experience.
- Experience with CI/CD pipelines, Git Ops, and infrastructure as code.
- Solid problem‑solving and collaboration skills.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).