• Author(s): Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, Pulkit Agrawal

The adaptation of large language models (LLMs) to cater to diverse human preferences, learn new skills, and unlearn harmful behavior is a crucial challenge. Traditional search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are effective but impractical due to their high inference cost. On the other hand, reinforcement learning (RL) methods are computationally efficient but struggle with optimization challenges in co-training the value function and policy. To address these limitations, a novel framework called Value Augmented Sampling (VAS) is proposed. VAS enables the maximization of various reward functions using data sampled from the initial, frozen LLM without requiring access to the model’s weights.

The key innovation of VAS lies in its ability to optimize the reward-maximizing policy without co-training the policy and value function, thereby ensuring a stable optimization process. This approach outperforms established baselines such as PPO and DPO on standard benchmarks and achieves comparable results to Best-of-128 with lower inference cost. Unlike existing RL methods, VAS does not require modifying the weights of the pre-trained LLM, making it suitable for adapting LLMs that are only available as APIs, such as ChatGPT.

Furthermore, VAS unlocks the capability of composing multiple rewards and controlling their extent during deployment, paving the way for the development of aligned and personalized LLMs in the future. This breakthrough has significant implications for the advancement of LLMs in various applications, including natural language processing, dialogue systems, and human-computer interaction.