Job Summary:
ByteDance is a global technology company known for its innovative products like TikTok and CapCut. They are seeking a Research Scientist to join their AI infrastructure team, focusing on designing scalable architectures and optimizing performance across the AI factory stack.
Responsibilities:
• Design and evaluate scalable architectures across the full AI factory — compute, storage, networking, chips, power, and the data and application layers — for large-scale training, RL, and inference workloads. Develop technical proposals for supply-chain and energy constraints alongside silicon and software trade-offs.
• Track emerging trends across AI systems, distributed training and RL, and hardware acceleration, as well as adjacent fields such as cognitive science and psychology that inform AI memory and reasoning substrates. Build prototypes and share insights through technical reports.
• Analyze and optimize performance across the ML stack — scheduling, networking, storage, training and RL frameworks, and emerging AI memory systems for long-horizon agents — through benchmarking and bottleneck analysis.
• Work across research, engineering, hardware, data-center, and product teams to translate AI workload requirements into scalable solutions and drive cross-team initiatives spanning the full AI factory.
• Research intelligent fault localization and root cause analysis for large-scale AI clusters, combined with intelligent tuning of time-series databases to improve cluster stability.
• Develop serverless high-performance elastic file systems and storage acceleration architectures specifically for AI scenarios, explore hardware-software co-optimization for DPU, and overcome AI storage performance bottlenecks.
• Research GPU/CPU/MEM heterogeneous collaborative scheduling technologies, build a heterogeneous power orchestration system for AI agents, and address scheduling challenges including heterogenous workloads and state dependencies.
• Optimize core vector retrieval technologies for LLM-powered applications, building a cloud-native distributed vector index engine to meet ultra-large-scale vector retrieval demands with low latency and low cost.
• Explore automatic infrastructure optimization based on AI Agent workflows, build a self-evolvable business agent framework, and enable full-stack intelligent optimization through AI for Infra.
Qualifications:
Required:
• Individuals who are completing or recently completed a PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related technical discipline.
• Backgrounds in cognitive science, computational neuroscience, or psychology are also welcome when paired with strong systems fundamentals.
• Experience in distributed systems, infrastructure engineering, or ML systems — including exposure to large-scale training or RL pipelines — and comfort evaluating trade-offs across hardware, software, algorithms, energy, and supply-chain constraints.
• Strong proficiency in integrating AI tools into knowledge discovery and research workflows.
• Demonstrated ability to learn quickly and stay productive on a fast-evolving technical horizon.
• Excellent communication skills to collaborate across teams.
Preferred:
• Experience with large-scale model training and inference — distributed pretraining, post-training, RL, KV cache–aware serving, GPU/accelerator optimization, and high-performance networking (e.g., RDMA, NCCL).
• Experience with heterogeneous AI compute systems, large-scale training clusters, HPC-style distributed workloads, and data pipelines for training and evaluation.
• Familiarity with AI memory systems, retrieval-augmented architectures, or agent long-term memory designs — bonus for exposure to cognitive-science or psychology literature on memory and reasoning.
• Exposure to chip-level design, data-center energy and cooling, or AI hardware supply-chain considerations across the AI factory.
• Publications in systems and/or machine learning conferences (e.g., NeurIPS, OSDI, SOSP, ASPLOS, MLSys).
• Contributions to open-source projects.
Company:
ByteDance is a technology company that develops content creation platforms and services. Founded in 2012, the company is headquartered in Beijing, CHN, with a team of 10001+ employees. The company is currently Late Stage.