System Engineer; Mandarin
Listed on 2026-02-24
-
IT/Tech
Systems Engineer
- Job Identification 325973
- Job Category Product Development
- Posting Date 02/06/2026, 09:47 PM
- Job Type Regular Employee
- Does this position require a security clearance? No
- Years 6 to 10+ years
- Applicants are required to read, write, and speak the following languages English
Here at OCI we’re building the world’s largest AI clusters and we’re the fastestat bringing them to customers. The Strategic Customers, Engineering team (SCE) at OCI is tasked with managing the relationships with some of our most significant AI Infra customers, who are leading the innovation in AIML applications, and also the key drivers of our revenue.
We are looking for a highly skilled GPU systems engineer for validating GPU performance and scalability on customer-representative systems hosted within OCI. You will interact closely with OCI GPU teams & partners as well as internal hardware and software development teams to drive customer GPU deliveries, to enhance our AI infrastructure to deliver exceptional customer experience and peak performance. You will also collaborate and supporting internal and external stakeholders in diagnosing performance, benchmark-related issues.
Responsibilities- Perform performance characterization on multi-GPU and multi-node systems
- Validate NCCL scalability across multi GPUs/Nodes/Clusters, Ensure benchmark results are correct, repeatable, and statistically valid
- Validate system configurations including GPU topology, PCIe, NVLink, NVSwitch, and network fabrics
- Compare measured NCCL performance against expected bandwidth and latency models
- Ensure GPU benchmarks are correctly validated against CPV
- Identify performance regressions across driver, firmware, CUDA, and NCCL releases
- Debug NCCL performance issues related to
- GPU topology and affinity
- Network interconnects (Infini Band, RoCE)
- CUDA, drivers, and system software
- Use NVIDIA profiling and debugging tools (Nsight Systems, Nsight Compute)
- Assist customers with benchmark setup, configuration, and best practices
- Provide actionable performance insights and recommendations
- Support and guide customer on system integration, performance testing and characterization
- Provide technical support for internal teams and external customers on benchmark and performance issues
- Collaborate and troubleshoot with service teams on architecture, driver, CUDA, NCCL, and networking related issues
- Reproduce and debug customer-reported performance problems
- Communicate findings clearly through reports, documentation, and presentations
- Support capacity program delivery and technical engagement & planning
- You will assist OCI service teams and partner teams like Nvidia in the root-cause of potential hardware or software bug
- Be the voice of customers to OCI’s various cloud engineering teams
Qualifications
- BS or MS in Computer Engineering, Computer Science, or related field, with 6+ years in Cloud infrastructure space.
- Solid understanding of cloud services, especially around compute, network and storages, as well as GPU architecture fundamentals
- Experience with multi-GPU and distributed systems. Hands-on experience with market-leading GPUs or AI platforms spanning development, bring-up, test, and characterization
- Hands-on experience running and analyzing GPU benchmarks
- Proficiency in Python, Bash, or similar scripting languages
- Experience with modern server platforms across x86 and ARM architectures
- Experience scripting and customizing diagnostics, validation, and test workflows
- Experience with GPU supplier test code and open-source AI test and characterization tools
- Experience with system integration, validation, and performance characterization
- Demonstrated ability to debug and root-cause complex hardware and software issues
- Proven ability to provide cross-functional technical leadership and collaborate effectively with internal teams and external partners
- Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes
- Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders
- Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).