Senior DevOps Service Reliability Operations Engineer – DGX Cloud
Listed on 2025-12-31
-
IT/Tech
Cloud Computing, Systems Administrator
NVIDIA’s NGC team is looking for highly motivated System Administrator/Dev Ops engineers to design, develop and implement a global, dynamic, innovative Service Reliability Operations Center, to provide extraordinary levels of support for our Cloud products and services. As a key member of the CIS Team (Compute Infrastructure Support), you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, Dev Ops teams, and other partners to help make our services capable of providing near 100% availability.
On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue. Working in partnership with the development community the CIS team will develop monitors, alarms, and alerts to help make the service more reliable and improve our customer experience.
- The team will provide services 24/7 with a follow-the-sun environment spanning continents.
- You will report directly to a manager in the United States.
- Some CIS shifts require either a Saturday or Sunday each week.
- Hours include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure 24/7 coverage.
- All team members use alerts and alarms to prevent incidents.
- You may work with developers to develop and implement predictive support or diagnostic routines.
- Perform systems, network, and security incident monitoring tasks.
- Work with developers to create runbooks, updating them as new features are added.
- Help discover incidents, initiate incident management procedures.
- Bring in subject matter authorities as needed to resolve issues.
- Feedback will help continually improve service.
- Interactively keep the team engaged through resolution, ensuring clients feel valued.
- May perform other tasks that help provide extraordinary service levels.
- Highly motivated with strong communication skills and ability to work successfully with multi‑functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies.
- 5+ years of experience administering large-scale production systems.
- 3+ years of experience in high‑availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC).
- BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.
- Expert‑level knowledge of Linux system administration and automation using Ansible and/or Python.
- Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls).
- Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment.
- Strong cross‑team collaboration, documentation, and mentoring skills.
- Experience improving processes for automation, reliability, and operational excellence.
- Expertise using monitoring tools and problem ticketing systems.
- Strong problem‑solving, analytical, and troubleshooting abilities.
- Advanced hands‑on experience with Kubernetes, SLURM, and large‑scale cluster management.
- Familiarity with GPU hardware and high‑performance computing environments.
- Experience with observability and incident management tools (Grafana, Open Telemetry, Pager Duty, JIRA).
- Cloud experience (AWS, Azure, GCP) is a plus; strong preference for on‑prem expertise.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 144,000 USD – 230,000 USD for Level 3, and 168,000 USD – 270,250 USD for Level 4. You will also be eligible for equity and.
Applications for this job will be accepted at least until November 18, 2025.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).