More jobs:
EOP - System Reliability Engineer - TS/SCI
Remote / Online - Candidates ideally in
Washington, District of Columbia, 20022, USA
Listed on 2026-03-06
Washington, District of Columbia, 20022, USA
Listing for:
cFocus Software Incorporated
Remote/Work from Home
position Listed on 2026-03-06
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing
Job Description & How to Apply Below
cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires a TS/SCI clearance.
Qualifications:- 5+ years and Bachelor's Degree in Computer Programming, Science, Engineering or a related technical discipline, or the equivalent combination of education, technical training, or work/military experience, including:
- 3+ years of related systems programming experience
- Experience maintaining an operational environment and use of monitoring tools and dashboard interfaces (ie. Kibana, Grafana)
- Experience working with container images and platforms (Kubernetes/Docker)
- Strong understanding of Dev Ops and software/application development processes
- Understanding of Git Lab, Jenkins, ArgoCD, and other Dev Ops/Continuous Integration tools for Kubernetes
- Understanding of microservice design and architectural pattern best practices
- Understanding of Python, Bash, and Shell scripting
- Knowledge of network technologies, common infrastructure components, load balancers, firewalls, virtual and physical infrastructure design
- problem solving and troubleshooting skills
- communication and interpersonal skills
- Must possess excellent time management skills and the drive to work unsupervised
- Experience with deploying to on prem/data center infrastructure
- Experience using Jira and Confluence on a daily basis
- Experience in building processes for deploying to a Kubernetes based environment using Gitlab and Helm
- Understanding of access management and security groups (i.e. IAM, S3 bucket, SSH, VPN, etc.)
- Ability to write and use unit and functional testing
- Technical
Skills:
Proficiency in programming languages (such as Python, Go, or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial, as SREs often work in these environments. - Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
- Understanding of SRE Principles:
Familiarity with key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets is important for measuring and maintaining system reliability. - Reliability and Availability: SRE practices help ensure that services are consistently available and reliable, which is critical for user satisfaction and business success.
- Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases, ensuring that performance remains optimal even under heavy load.
- Cost Management:
By optimizing resource usage and reducing downtime, SREs contribute to cost savings for organizations. - Programming and Scripting:
Proficiency in languages like Python, Go, or Ruby is crucial for automating tasks and managing infrastructure. - Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
- Cloud Computing:
Familiarity with cloud platforms like AWS, Azure, or Google Cloud is vital for deploying and managing applications in distributed environments. - Containers & Orchestration:
Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications. - Monitoring and Logging:
Proficiency in tools like Prometheus, Grafana, or Elasticsearch, Logstash, and Kibana (ELK) Stack is necessary for tracking metrics, setting up alerts, and analyzing logs. - Networking:
Knowledge of networking protocols and configurations is essential for maintaining system health and performance. - Configuration Management:
Skills in managing and maintaining system configurations are critical for ensuring system reliability. - Incident Response:
Ability to respond quickly and effectively to incidents, including documenting and learning from them. - Security Best Practices:
Understanding security protocols and best practices to protect systems from vulnerabilities. - These skills are essential for SREs to maintain high availability and performance, balancing the demands of development and operations.
- Support required during core business hours of 8am – 5pm, Monday through Friday.
- On-call for…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×