Site Reliability Engineer Job Washington area,District of Columbia USA,IT/Tech

Position: 154935 Site Reliability Engineer

Seeking a Site Reliability Engineer for a high-impact role with a premier client based in Washington, DC
. In this position, you will bridge the gap between development and operations by applying a software engineering mindset to system administration and infrastructure. You will be responsible for ensuring the scalability, performance, and high availability of cloud-based services across AWS and Azure environments. By leveraging Infrastructure-as-Code, advanced observability with Dynatrace, and SRE principles like error budgets and SLOs, you will drive operational excellence and lead incident response efforts for mission-critical applications.

Key Responsibilities

Deployment & Automation:
Architect and manage CI/CD pipelines (Git Hub Actions, AWS Code Pipeline) and automate global infrastructure using Terraform, Cloud Formation, or CDK.
Performance & Capacity:
Drive cost-optimization initiatives, manage auto-scaling thresholds, and execute resiliency/performance testing to ensure system durability.
Incident Management:
Act as a primary on-call responder using ITIL frameworks and Service Now; develop Root Cause Analysis (RCA) documentation and maintain knowledge bases.
Observability & Monitoring:
Implement distributed tracing and optimize monitoring via Dynatrace and Kibana to create advanced dashboards and anomaly detection.
Reliability Engineering:
Define and monitor SLIs and SLOs while managing error budgets to balance feature velocity with system stability.
Security & Compliance:
Oversee service accounts, manage digital certificates, and execute rapid remediation for security incidents.

Qualifications

Education:

Bachelor's degree in Computer Science, Engineering, or a related technical field.
Experience:

2 to 4 years of professional experience in SRE, Dev Ops, or Infrastructure roles.
Cloud Proficiency:
Practical, hands-on experience with both AWS and Azure platforms.
Technical

Skills:

Mid-level proficiency in Python (or similar scripting languages) and configuration management tools like Ansible.
Containerization:
Solid understanding of Docker and orchestration via Kubernetes or ECS.
Infrastructure Fundamentals:
Strong knowledge of Linux systems, networking protocols, and both Relational/No

SQL database architectures.
Soft Skills:

Excellent written and verbal communication skills with the ability to manage competing priorities independently.
Flexibility:
Ability to participate in a production on-call rotation, including work outside standard business hours.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language