Abstract:
The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategies, we develop a conceptual framework for reasoning about HP transfer across scale. In synthetic settings, we present quantitative examples where transfer either offers a provable computational advantage or fails even under μP. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis in large language model pretraining.
Speaker Bio:
Dr. Denny Wu is a Faculty Fellow at the Center for Data Science, New York University, and the Flatiron Institute. His research focuses on developing a mathematical foundation for modern machine learning systems, particularly neural networks. He obtained his Ph.D. in Computer Science from the University of Toronto and the Vector Institute, under the supervision of Prof. Jimmy Ba and Prof. Murat A. Erdoğdu, and completed his undergraduate studies at Carnegie Mellon University under the supervision of Prof. Ruslan Salakhutdinov.
Public events of RIKEN Center for Advanced Intelligence Project (AIP)
Join community