Cloud Operations Lead and SRE Manager
Listed on 2026-02-28
-
IT/Tech
Systems Engineer, Cloud Computing, IT Project Manager, Cybersecurity
Overview
Empower AI is AI for government. Empower AI gives federal agency leaders the tools to elevate the potential of their workforce with a direct path for meaningful transformation. Headquartered in Reston, Va., Empower AI leverages three decades of experience solving complex challenges in Health, Defense, and Civilian missions. Our proven Empower AI Platform® provides a practical, sustainable path for clients to achieve transformation that is true to who they are, what they do, how they work, with the resources they have.
The result is a government workforce that is exponentially more creative and productive. For more information, visitwww.
Empower.ai.
Empower AI is proud to be recognized as a 2024 Military Friendly Employer by Viqtory, the publisher of G.I. Jobs. This designation reflects the company’s commitment to hiring and supporting active-duty and veteran employees.
ResponsibilitiesThe Cloud Operations Lead / SRE Manager (Cloud/SRE Mgr) provides enterprise-level operational management of cloud operations and Site Reliability Engineering (SRE) leadership for the Department of Homeland Security (DHS), U.S. Citizenship and Immigration Services (USCIS) information technology (IT) infrastructure. USCIS has over 27,000 Government employees and contractors working at over 250 offices worldwide.
The USCIS Enterprise Infrastructure Division (EID) of the Office of Information Technology (OIT) provides IT infrastructure engineering, design, testing, implementation and operational support services for all USCIS enterprise components, to include networks, server rooms, data storage, telecommunications, video conferencing services and infrastructure security. The Cloud/SRE Mgr directly supports EID to coordinate, direct, manage, and oversee the design, development, integration, standards, operation and maintenance of cloud operations and SRE of the enterprise IT infrastructure that supports USICS operations.
The Cloud/SRE Mgr shall oversee the Cloud Operation Team (est. 5 technicians) responsible for executing cloud operations of the USCIS IT infrastructure. This position is responsible for the delivery of the reliability, availability, and operational excellence of USCIS cloud platforms This role combines hands-on technical leadership with people management (est 5 technicians). The Cloud/SRE Mgr will also apply Site Reliability Engineering (SRE) principles to ensure highly available, secure, and compliant production systems.
The ideal candidate brings a strong background in cloud infrastructure, automation, and Dev Ops, paired with proven experience leading operational teams, managing incidents, and driving reliability at scale in regulated environments.
Overall
Responsibilities:
- Own the reliability, availability, and performance of production cloud platforms and services.
- Define, monitor, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical systems.
- Lead incident response, including coordination during outages, root cause analysis, and blameless postmortems.
- Establish and manage on-call rotations, escalation paths, and operational readiness standards.
- Drive continuous reduction of operational toil through automation and process improvement.
Cloud Architecture & Platform Engineering
- Design and operate secure, scalable, and highly available AWS infrastructure, including multi-AZ and multi-region architectures.
- Ensure platforms are resilient, fault-tolerant, and aligned with disaster recovery and business continuity requirements.
- Partner with application teams to ensure production readiness and reliability by design.
Security & Compliance
- Implement and enforce cloud security best practices, including IAM, encryption, logging, and audit controls.
- Ensure compliance with government and regulatory frameworks such as FedRAMP.
- Collaborate closely with security and compliance stakeholders to meet accreditation and audit requirements.
Automation, Dev Ops & Observability
- Lead development of infrastructure-as-code (IaC) using Terraform and/or AWS Cloud Formation.
- Build and maintain CI/CD pipelines supporting reliable, repeatable deployments.
- Design and operate monitoring, alerting, logging, and observability solutions to ensure actionable insights and reduce alert fatigue.
Team Leadership & Management
- Lead, mentor, and develop a team of Cloud / SRE engineers.
- Support hiring, onboarding, performance feedback, and career growth.
- Set technical direction, operational priorities, and reliability goals for the team.
- Foster a culture of ownership, learning, and continuous improvement.
Collaboration & Communication
- Partner with development, security, compliance, and business stakeholders to align reliability goals with delivery timelines.
- Communicate reliability risks, incident outcomes, and improvement plans to senior leadership.
- Produce and maintain clear operational documentation, runbooks, and architectural standards.
- Bachelor’s degree in Computer Science, Engineering, or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).