Research Scientist - Machine Learning System
Listed on 2026-01-14
-
IT/Tech
Machine Learning/ ML Engineer, Data Scientist, AI Engineer
Research Scientist - Machine Learning System
Location:
San Jose
Team:
Technology
Employment Type:
Regular
Job Code: A214386
Share this listing:
ResponsibilitiesAML-MLsys combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM, AIGC, and AGI. In our team, you will have the opportunity to build large‑scale heterogeneous systems integrating with GPU, NPU, RDMA, and storage, keep them running stable and reliable, deepen your expertise in coding, performance analysis, and distributed systems, and participate in decision‑making processes.
You will also be part of a global team with members from the United States, China, and Singapore working collaboratively towards unified project direction.
- Responsible for developing and optimizing LLM training, inference, and reinforcement learning (RL) framework.
- Work closely with model researchers to scale LLM training and RL to the next level.
- Responsible for GPU and CUDA performance optimization to create an industry‑leading high‑performance LLM training and inference and RL engine.
Minimum Qualifications
- Bachelor’s degree or above, major in computer, electronics, automation, software, ух?
- Proficient in algorithms and data structures, familiar with Python.
- Understand the basic principles of deep learning algorithms, be familiar with the basic architecture of neural networks, and understand deep learning training frameworks such as PyTorch.
- Proficient in GPU high‑performance computing optimization technology on CUDA, in‑depth understanding of computer architecture, familiar with parallel computing optimization, memory access optimization, low‑bit computing, etc.
- Familiar with FSDP, Deepspeed, JAX SPMD, Megatron‑LM, Verl, Tensor
RT‑LLM, ORCA, VLLM, SGLang, etc. - Knowledge of LLM models and experience accelerating LLM model optimization.
The base salary range for this position in the selected city is $136,800 - $359,720 annually.
Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience, and location. Base pay is one part of the Total 변화ㅗ Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses, incentives, and restricted stock units.
BenefitsEmployees have day‑one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short‑term and long‑term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year, and 17 days of paid personal time (prorated upon hire with increasing accruals by tenure).
The Company reserves the right to modify or change these benefits programs at any time, with or without notice.
For Los Angeles County (unincorporated) CandidatesQualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:
Founded in 2012, Byte Dance’s mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including Tik Tok, Lemon8, Cap Cut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, Byte Dance has made it easier and more fun for people to connect with, consume, and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).