Job Summary:
ByteDance is a pioneering technology company dedicated to advancing artificial general intelligence. We are looking for a Research Engineer Graduate to improve the reliability and performance of large-scale AI training systems, collaborating with various teams to enhance system scalability and efficiency.
Responsibilities:
• Improve the reliability and performance of large-scale training systems across pre-training, fine-tuning, evaluation, and inference
• Build observability, profiling, and debugging tools for distributed ML workloads
• Identify and optimize performance bottlenecks across GPU, networking, and storage layers
• Contribute to distributed training frameworks in multi-GPU and multi-node environments
• Collaborate with model and infrastructure teams to improve system scalability and efficiency
• Support incident analysis and operational stability
Qualifications:
Required:
• Individuals who are completing or have recently completed a PhD degree in Software Development, Computer Science, Computer Engineering, or a related technical discipline.
• Strong programming skills in C++ and Python
• Solid understanding of PyTorch training workflows and distributed runtime behavior
• Familiarity with CUDA execution, NCCL communication, and GPU systems fundamentals
Preferred:
• Experience with performance profiling and debugging tools (e.g., torch.profiler, Nsight)
• Familiarity with distributed training or parallelization strategies (e.g., FSDP, Megatron-LM)
• Ability to analyze and optimize performance in complex ML training systems
Company:
ByteDance is a technology company that develops content creation platforms and services. Founded in 2012, the company is headquartered in Beijing, CHN, with a team of 10001+ employees. The company is currently Late Stage.