• Author(s): Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto

Understanding the variation in language model performance with scale is essential for benchmarking and algorithm development. Traditional scaling laws, which require training models at various scales, have been limited in their application due to the extensive resources needed. This paper introduces an alternative, observational approach that constructs scaling laws from approximately 80 publicly available models, eliminating the need for extensive model training.

Creating a unified scaling law from diverse model families presents challenges due to significant differences in their training compute efficiencies and capabilities. However, the study demonstrates that these variations align with a straightforward, generalized scaling law. This law posits that language model performance is a function of a low-dimensional capability space, with model families differing only in their efficiency in converting training compute into capabilities.

The findings reveal the surprising predictability of complex scaling phenomena. Several emergent phenomena exhibit smooth, sigmoidal behavior and can be predicted from smaller models. Additionally, the performance of advanced models like GPT-4 can be accurately forecasted using simpler, non-agentic benchmarks. The study also illustrates how to predict the effects of post-training interventions, such as chain-of-thought and self-consistency, as language model capabilities continue to advance.

This approach offers a more efficient method for understanding and predicting language model performance, providing valuable insights for future model development and optimization. By leveraging publicly available models and focusing on a generalized scaling law, this research paves the way for more accessible and scalable benchmarking practices in the field of language modeling.