A Brief Survey of Deep Reinforcement Learning
Reinforcement Learning Algorithms
There are two main approaches to solving RL problems: methods based on value functions and methods based on policy search. There is also a hybrid, actor-critic approach, which employs both value functions and policy search.
Value Functions
Value function methods are based on estimating the value (expected return) of being in a given state. The state-value function \(V^\pi(s)\) is the expected return when starting in state \(s\) and following \(\pi\) henceforth:
\[V^\pi(s)=\mathbb{E}[R \mid s, \pi]\]The optimal policy, \(\pi\), has a corresponding state-value function \(V(s)\), the optimal state-value function can be defined as:
\[V*(s) = \max_\pi V^\pi (s) \ \forall s \in \mathcal{S}\]If we had \(V(s)\) available, the optimal policy could be retrieved by choosing among all actions available at \(s_t\) and picking the action \(a\) that maximizes \(\mathbb{E}_{s_{t+1}~\mathcal{s_{t+1} \mid s_t, a}}[V (s_{t+1})]\).
The transition dynamics \(\mathcal{T}\) are unavailable. Therefore, we construct another function, the state-action-value or quality function \(Q^\pi(s,a)\), which is similar to \(V^pi\), except that the initial action \(a\) is provided, and \(\pi\) is only followed from te succeeding state onwards:
\[Q^\pi(s,a)=\mathbb{E}[R \mid s, a, \pi]\]The best policy, given \(Q^\pi(s,a)\), can be found by choosing a greedily at every state: \(\arg \max_a Q^\pi (s, a)\). Under this policy, we can also define \(V^\pi (s)\) by maxiimzing \(Q^\pi(s,a): V^\pi(s)=\max_a Q^\pi(s, a)).
Dynamic Programming
To actually learn \(Q^\pi\), we exploit the Markov property and define the function as a Bellman equation, which has the following recursive form:
\[Q^\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}} [r_{t+1} + \gamma Q^\pi (s_{t+1}, \pi(s_{t+1}))]\]This means that \(Q^\pi\) can be improved by bootstrapping: we can use the current values of our estimate of \(Q^\pi\) to improve our state. This is the foundation of Q-learning and the state-action-reward-state-action algorithm:
\[Q^\pi(s_t, a_t) \leftarrow Q^\pi (s_t, a_t) + \alpha \delta\]where \(\alpha\) is the learning rate and \(\delta = Y-Q^\pi(s_t, a_t)\) the temporal difference (TD) error; here, \(Y\) is a target as in a standard regression problem. SARSA, an on-policy learning algorithm, is used to improve the estimate of \(Q^\pi\) by using transitions generated by the behavioral policy (the policy derived by \(Q^\pi\)), which results in setting \(Y=r_t+\gamma Q^\pi (s_{t+1}, a_{t+1})\). Q-learning is off-policy, as \(Q^\pi\) is instead updated by transitions that were not necessarily generated by the derived policy. Q-learning uses \(Y=r_t + \gamma \max_a Q^\pi (s_{t+1}, a)\), which directly approximates \(Q\). To find \(Q\) from an arbitrary \(Q^\pi\), we use generalized policy iteration, where policy iteration consists of policy evaluation and policy improvement. Policy evaluation improves the estimate of the value function, which can be achieved by minimizing TD errors from trajectories experienced by following the policy.
As the estimate improves, the policy can naturally be improved by choosing actions greedily based on the updated value function. Instead of performing these steps separately to convergence, generalized policy iteration allows for interleaving the steps, such that progress can be made more rapidly.
Sampling
Istead of boostrapping value functions using dynamic programming methods, Monte Carlo methods estimate the expected return from a state by averaging the return from multiple rollouts of a policy.