## On the generation of representations for reinforcement learning

### Sun, Yi ; Schmidhuber, Jürgen (Dir.)

### Thèse de doctorat : Università della Svizzera italiana, 2012 ; 2012INFO001.

Creating autonomous agents that learn to act from sequential interactions has long been perceived as one of the ultimate goals of Artificial Intelligence (AI). Reinforcement Learning (RL), a subfield of Machine Learning (ML), addresses important aspects of this objective. This dissertation investigates a particular problem encountered in RL called representation generation. Two related... Plus

Ajouter à la liste personnelle- Summary
- Creating autonomous agents that learn to act from sequential interactions has long been perceived as one of the ultimate goals of Artificial Intelligence (AI). Reinforcement Learning (RL), a subfield of Machine Learning (ML), addresses important aspects of this objective. This dissertation investigates a particular problem encountered in RL called representation generation. Two related sub-problems are considered, namely basis generation and model learning, concerning which we present three pieces of original research. The first contribution considers a particular basis generation method called online kernel sparsification (OKS). OKS was originally proposed for recursive least squares regression, and shortly thereafter extended to RL. Despite the popularity of the method, important theoretical questions are still to be answered. In particular, it was unclear how the size of the OKS dictionary, or equivalently the number of basis functions constructed, grows in relation to the amount of data available. Characterizing this growth rate is crucial to understanding OKS, both in terms of its computational complexity and, perhaps more importantly, the generalization capability of the resulting linear regressor or value function estimator. We investigate this problem using a novel formula expressing the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Based on this formula, we are able to connect the cardinality of the dictionary with the eigen- decay of the covariance operator. In particular, we prove that under certain technical conditions, the size of the dictionary will always grow sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor or value function estimator constructed from the resulting dictionary is consistent. The second contribution turns to a different class of basis generation methods, which make use of reward information. We introduce a new method called V-BEBF. V-BEBF relies on a principle that is different from that of previous approaches based on Bellman error basis function (BEBF), in which approximations to the value function of the Bellman error, rather than to the Bellman erroritself as in BEBF, are added as new basis functions. This approach is justified by a simple yet previously unspotted insight, i.e., V-BEBF, if computed exactly, is in fact the error in value estimation, and therefore its addition to the existing set of basis functions immediately allows the value function to be represented accurately. We demonstrate that V-BEBF is a promising alternative to BEBF, especially when the discount factor approaches $1$, in which case it is proven that BEBF, even if computed exactly, can be very inefficient. Limited experiments, where both V- BEBFs and BEBFs are approximated using linear combinations of the input features, are also conducted, and the result is in line with the theoretical finding. The last contribution focuses on model learning, especially learning the transition model of the environment. The problem is investigated under a Bayesian framework, where the learning is done by probabilistic inference, and the learning progress is measured using Shannon information gain. In this setting, we show that the problem can be formulated as an RL problem, where the reward is given by the immediate information gain resulting from performing the next action. This shows that the model-learning problem can in principle be solved using algorithms developed for RL. In particular, we show theoretically that if the environment is an MDP, then near optimal model learning can be achieved following this approach.