Large neural network models have revolutionized natural language processing and computer vision, but the process of setting their initialization and learning rates often relies on heuristic methods, leading to inconsistencies across different models and research papers. The μ-Parameterization (μP) approach offers a promising solution to these challenges, providing scaling rules for model initialization and learning rates. It also enables zero-shot hyperparameter transfer from small to large models, as reported in previous research.

Despite the apparent advantages of μP, its implementation complexity and varied theoretical background may have hindered its widespread adoption. In this work, we aim to demystify μP and evaluate its effectiveness empirically, focusing on the ubiquitous transformer architecture. Our research question is straightforward: does μ-Transfer yield optimal learning rates in practice?

Through comprehensive experiments, we test models ranging from 2 million to 10 billion parameters. Our results show that μ-Transfer works as intended for the majority of important cases, providing a reliable solution for model initialization and learning rate scaling. However, we also uncover surprising instances where μ-Transfer may not perform as expected.

By shedding light on the practical implications of μP, we hope to encourage further exploration and adoption of this promising approach. Our findings contribute to a deeper understanding of model initialization and learning rate scaling, offering insights that can enhance the efficiency and effectiveness of training large neural network models.