More jobs:
Training Performance Engineer
Job in
Santa Clara, Santa Clara County, California, 95053, USA
Listed on 2026-03-02
Listing for:
ObjectWin Technology
Apprenticeship/Internship
position Listed on 2026-03-02
Job specializations:
-
Software Development
Machine Learning/ ML Engineer, AI Engineer
Job Description & How to Apply Below
Training Performance Engineer THE PERSON
The ideal candidate should have experience with distributed training pipeline, knowledgeable with distributed training algorithms (Data parallel, Tensor parallel, Pipeline parallel, ZeRO) and familiar with training Large Model.
KEY RESPONSIBILITIES- Train large model to convergence on AMD GPUs.
- Improve the end-to-end training pipeline performance.
- Optimize the distributed training pipeline and algorithm to scale out.
- Contribute your changes to open source.
- Up to date with latest training algorithms.
- Influence the direction of AMD AI platform.
- Cross team collaborate with various group and stakeholder.
- Experience in ML frameworks such as PyTorch, JAX or Tensorflow.
- Experience with distributed training and distributed training framework such as Deep Speed.
- Experience with LLM or Vision, especially large model is a plus.
- Excellent python programing skills, including debugging, profiling, and perf analysis.
- Experience with ML pipeline.
- Strong communication and problem-solving skills.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×