15:00- 15:45
Title: Reusing Data in Policy Gradients to Improve Sample Efficiency
Speaker: Prof. Matteo Papini
Abstract: Policy gradient methods are reinforcement learning algorithms that optimize parametric policies via stochastic gradient ascent, typically using on-policy interaction data. It is well known that this reliance on on-policy data makes them sample-inefficient. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as old gradients or data collected with old policies. While gradient reuse has received substantial theoretical attention, leading to improved sample complexity guarantees, direct sample reuse remains largely unexplored from a theoretical perspective, mostly due to the challenges arising from distribution shift. I will present an actor-only policy gradient algorithm based on a novel multiple importance sampling estimator, designed to re-use trajectory data collected from the k most recent policies. The algorithm comes with theoretical guarantees of improved sample efficiency and promising empirical results. I will also present some preliminary results on data reuse in Proximal Policy Optimization (PPO).
Short bio: Matteo Papini is a Tenure-Track Assistant Professor at University of Milan, Italy, where he teaches Reinforcement Learning and Machine Learning. Previously he was a postdoctoral researcher in Politecnico di Milano and Universitat Pompeu Fabra (Barcelona, Spain), and a research intern at Facebook AI Research (now Meta). He holds a PhD in Information Technology from Politecnico di Milano. His main research interests are reinforcement learning and online learning, with a focus on policy gradient algorithms and continuous-space RL theory.
15:45- 16:30
Title: Bridging Rested and Restless Bandits with Graph-Triggering
Speaker: Dr. Gianmarco Genalti
Abstract: Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. I will present the Graph-Triggered Bandits (GTBs) setting, a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms’ expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms (i, j) represents the fact that a pull of arm i triggers the evolution of arm j, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, I will focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, it is possible to characterize the optimal policies. I will present suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.
Bio: Gianmarco Genalti is a Postdoctoral Researcher at the Department of Electronics, Information and Bioengineering of Politecnico di Milano. His research focuses on the theory of multi-armed bandits and online algorithms, with particular interest in dynamic environments, heavy-tailed rewards, and learning-augmented online algorithms.
Public events of RIKEN Center for Advanced Intelligence Project (AIP)
Join community