Skip to main content Skip to main navigation


Exploration Driven by an Optimistic Bellman Equation

Samuele Tosatto; Carlo D'Eramo; Joni Pajarinen; Marcello Restelli; Jan Peters
In: International Joint Conference on Neural Networks. International Joint Conference on Neural Networks (IJCNN-2019), July 14-19, Budapest, Hungary, Pages 1-8, IEEE, 2019.


Exploring high-dimensional state spaces and finding sparse rewards are central problems in reinforcement learning. Exploration strategies are frequently either naïve (e.g., simplistic-greedy or Boltzmann policies), intractable (i.e., full Bayesian treatment of reinforcement learning) or rely heavily on heuristics. The lack of a tractable but principled exploration approach unnecessarily complicates the application of reinforcement learning to a broader range of problems. Efficient exploration can be accomplished by relying on the uncertainty of the state-action value function. To obtain the uncertainty, we maintain an ensemble of value function estimates and present an optimistic Bellman equation (OBE) for such ensembles. This OBE is derived from a relative entropy maximization principle and yields an implicit exploration bonus resulting in improved exploration during action selection. The implied exploration bonus can be seen as a well-principled type of intrinsic motivation and exhibits favorable theoretical properties. OBE can be applied to a wide range of algorithms. We propose two algorithms as an application of the principle: Optimistic Q-learning and Optimistic DQN which outperform comparison methods on standard benchmarks.

Weitere Links