报告主题: Towards Robust, Efficient and Practical Decision Making: From Reward-Maximizing Deep Reinforcement Learning to Reward-Matching GFlowNets
Recent years have witnessed the great success of RL with deep feature representations in many challenging tasks, including LLMs, computer games, robotics, and so on. Yet, solely focusing on the optimal solution based on a reward proxy and learning the reward-maximizing policy is not enough. Diversity of the generated states is desirable in a wide range of important practical scenarios such as dialogue systems, drug discovery, recommender systems, etc. For example, in typical RLHF scenarios, the proxy reward function can be uncertain and imperfect itself (compared to the gold reward model). Therefore, it is not sufficient to only search for the solution that maximizes the return. Instead, it is desired that we sample many high-reward candidates, which can be achieved by sampling them proportionally to the reward of each terminal state. The Generative Flow Network (GFlowNet) is a probabilistic generative model proposed by Yoshua Bengio in 2021 where an agent learns a stochastic policy for object generation, such that the probability of generating an object is proportional to a given reward function, i.e., by learning a reward-matching policy. Its effectiveness has been shown in discovering high-quality and diverse solutions in LLM alignment, molecule generation, etc. The talk concerns my recent research works about how we tackle three important challenges in such decision-making systems. Firstly, how can we ensure a robust learning behavior and value estimation of the agent? Secondly, how can we improve its learning efficiency? Thirdly, how to successfully apply them in important practical applications such as computational sustainability problems and combinatorial optimization?