Site Reliability Engineer Job Herndon area,Virginia USA,IT/Tech

By joining the National Student Clearinghouse, you can be sure that the work you do now will help shape the future of education and the workforce in the U.S. As the trusted source for higher education data since 1993, the Clearinghouse is the leading provider of transcript and data exchange services, automated enrollment and degree verifications, learner insights and research, and compliance solutions for schools, businesses, and learners nationwide.

As a 501(c)(3) nonprofit organization, the Clearinghouse works with nearly 3,600 postsecondary institutions to meet their compliance needs and with thousands of high schools and districts to provide continuing collegiate enrollment, progression, and completion statistics about their alumni. In addition, the Research Center publications inform policymakers and business leaders about student educational pathways. Using our unique combination of data, analytics, and software to drive our mission, security and privacy is paramount.

Join us as we continue to invest in our talent and new advanced technologies to unlock the power of data on behalf of all learners.

About the Role:

The Site Reliability Engineer role exists to ensure the reliability, scalability, and performance of an organization’s systems and services. This position bridges software engineering and IT operations, applying automation, monitoring, and proactive incident management to maintain highly available and resilient platforms. SREs design and implement solutions that reduce operational toil, improve system efficiency, and enable development teams to deliver features quickly without compromising stability.

By focusing on service availability, performance optimization, and continuous improvement, the SRE function is critical to sustaining customer trust and supporting business growth.

This position operates with a high degree of autonomy, making independent decisions on system reliability, performance optimization, and incident response within established service-level objectives. The role requires discretion in prioritizing tasks, implementing automation, and resolving complex technical issues to maintain system stability and meet business continuity goals.

Currently, this is a remote-first position, and this position may be required to periodically work on-site at our office and the frequency would depend on the department/division's requirements. Therefore, candidates must either reside within a reasonable distance to commute to our office or be willing to travel to our office in Herndon, when required.

How You Contribute:

Demonstrate the Clearinghouse's core competencies:
Customer Focus, Optimizes Work Processes, Collaborates, Communicates Effectively, and Be Open and Authentic.
Reliability Engineering & SLOs:
Define SLIs/SLOs and manage error budgets; drive reliability reviews and continuous improvement to protect customer experience.
Observability & Monitoring:
Build and operate end-to-end observability (metrics, traces, logs, synthetics, dashboards, alerting), leveraging tools such as Datadog; tune alerts for actionability and reduce noise.
Incident Management:
Participate in and help coordinate incident response and on-call rotations; lead blameless post-incident reviews, root-cause analysis, and corrective action tracking.
Automation & CI/CD:
Partner with engineering to automate build, test, deploy, and release processes (e.g., Git Lab CI) and promote progressive delivery, change safety, and rollback strategies.
Infrastructure as Code & Cloud:
Provision and manage cloud infrastructure with Terraform/Cloud Formation on AWS/Azure/GCP; enforce configuration baselines and drift detection.
Containers & Orchestration:
Operate containerized workloads at scale (Kubernetes, Helm); release strategies (blue/green, canary).
Performance & Capacity:
Conduct performance testing and tuning; lead capacity planning and cost-aware scaling.
Security & Compliance:
Embed security into pipelines and environments (e.g., IAM guardrails, policy-as-code, audit logging, vulnerability management, Wiz exposure where applicable) in partnership with Dev Sec Ops .
Runbooks & Documentation:
Create and maintain runbooks, operational SOPs, and service catalogs; promote knowledge sharing and operational readiness across teams.
Collaboration:

Work across engineering, infrastructure, devsecops, security, and product to deliver reliable, scalable services; communicate clearly with technical and non-technical stakeholders.
Continuous Improvement:
Identify toil, propose experiments (e.g., chaos testing, game days), and automate repetitive operations to improve MTTR and deployment safety (DORA metrics awareness).
Perform other duties as required.

Position may be required to perform other duties as required. These essential functions are representative of those that must be met by an employee to successfully perform the job. Reasonable accommodations will be made to enable individuals with disabilities to perform these essential functions.

What You…


Increase/decrease your Search Radius (miles)



Job Posting Language