Research Scientist, HPC Workflows
Listed on 2026-03-12
-
IT/Tech
Cybersecurity, Cloud Computing
Requisition Id16060
OverviewOak Ridge National Laboratory (ORNL), home to some of the world’s most powerful supercomputers, is seeking a Research Scientist in HPC Workflows to design, orchestrate, and maintain computational workflows that enable reproducible, scalable science on leadership-class systems. You will collaborate with researchers across diverse domains to translate scientific objectives into robust pipelines, automate job orchestration and data movement, and optimize end-to-end workflow performance on large-scale Linux-based HPC environments and hybrid cloud/HPC platforms.
Job Duties and Responsibilities May Include- Support research on HPC workflows to support the mission of the National Center for Computational Sciences.
- Workflow Design and Orchestration:
Architect, implement, and maintain HPC workflows and pipelines that leverage job schedulers (e.g., SLURM, PBS) with job dependencies, arrays, and resource‑aware templates. Establish reproducible execution patterns, including environment setup, module management, data staging, and cleanup. - Scripting and Tooling:
Develop command‑line tools and automation in Python, Bash, and/or C/C++ to encapsulate workflow steps, manage configuration files (e.g., YAML/JSON), and implement robust logging, error handling, and checkpoint/retry strategies. - Operational Reliability and Optimization:
Diagnose job failures, mitigate bottlenecks, and improve throughput, latency, and resource utilization. Use scheduler and Linux tools (e.g., sacct, squeue, coreutils, ssh, tmux, top, iostat) to monitor, analyze, and tune workflows. - Automation and Version Control:
Implement CI/CD practices for workflow deployment, create templates and reusable libraries, and manage changes with Git. Automate environment provisioning and repeatable execution across systems and users. - Collaboration and User Enablement:
Consult with researchers to understand requirements, translate them into executable workflows, and provide documentation, training, and examples. Partner with operations teams to align workflows with policies and best practices. - Observability and Reporting:
Build simple status dashboards or reports for workflow health and progress. Aggregate job metrics, queue statistics, and resource usage to inform planning and continuous improvement. - Security and Compliance:
Apply basic cyber‑security principles (e.g., SSH key hygiene, least privilege, firewall rules) to workflow design and operations. Handle credentials and secrets responsibly. - Documentation and Support:
Author clear, user‑focused documentation and contribute to playbooks, runbooks, and knowledge bases. Participate in an on‑call rotation for critical workflows as needed. - Cloud and Hybrid HPC Integration:
Design and operate workflows on public cloud platforms (AWS, Azure, or GCP) and in hybrid on‑prem/cloud environments. Leverage cloud object storage (e.g., Amazon S3) for data staging and artifacts; implement parallel, secure data movement and lifecycle policies.
- Ph.D. in Computer Science, Computer Engineering, Computational Engineering, or a closely related field.
- At least 2 years of experience working with Linux‑based systems; familiarity with core utilities and managed services such as coreutils, ssh, tmux, and common system services.
- At least 1 year of programming experience in one or more of Python, C/C++, or Bash.
- Strong verbal and written communication skills, with the ability to collaborate across technical and scientific teams.
- Demonstrated experience in leading scientific research and publishing in high‑impact venues.
- Experience with HPC job schedulers such as SLURM or PBS.
- Familiarity with basic cyber‑security principles (e.g., firewalls, network segmentation, secure configuration).
- Basic web development skills (e.g., HTML, CSS) for lightweight dashboards or documentation.
For employment at Oak Ridge National Laboratory (ORNL), a Real form of identification will be required. Additionally, ORNL is subject to Department of Energy (DOE) access restrictions. All employees must also be able to obtain and maintain a federal…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).