The following explanations are greatly inspired by the notebooks made by Omar Darwiche Domingues as part of the course Sequential Decision Making. The notebooks are available here.
In reinforcement learning, an agent interacts with an environment by taking actions and observing rewards. Its goal is to learn a policy, that is, a mapping from states to actions, that maximizes the amount of reward it gathers.
The enviroment is modeled as a Markov decision process (MDP), defined by a set of states
A policy
$$
V^\pi(s) = \sum_{a\in \mathcal{A}} \pi(a|s) Q^\pi(s, a),
\quad \text{where} \quad
Q^\pi(s, a) = \mathbf{E}\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \Big| S_0 = s, A_0 = a\right].
$$
and represents the mean of the sum of discounted rewards gathered by the policy
The action-value function
The optimal value function, defined as $V^(s) = \max_\pi V^\pi(s)$ can be shown to satisfy $V^(s) = \max_a Q^(s, a)$, where $Q^$ is the fixed point of the optimal Bellman operator
$$ Q^(s, a) = T^ Q^(s, a) $$ where, for any function $f: \mathcal{S}\times\mathcal{A} \to \mathbb{R}$ $$ T^ f(s, a) = r(s, a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} f(s', a') $$ and there exists an optimal policy which is deterministic, given by $\pi^(s) \in \arg\max_a Q^(s, a)$.
If both the reward function
- Start with arbitrary
$Q_0$ , set$t=0$ . - Compute
$Q_{t+1}(s, a) = T^*Q_t(s,a)$ for every$(s, a)$ . - If
$\max_{s,a} | Q_{t+1}(s, a) - Q_t(s,a)| \leq \varepsilon$ , return$Q_{t}$ . Otherwise, set$t \gets t+1$ and go back to 2.
The convergence is guaranteed by the contraction property of the Bellman operator, and
Using Q-Learning we don't need to know the transition probabilities, we can approximate
Q-Learning with
- Start with arbitrary
$Q_0$ , get starting state$s_0$ , set$t=0$ . - Choosing action
$a_t$ :
- With probability
$\varepsilon$ choose$a_t$ randomly (uniform distribution) - With probability
$1-\varepsilon$ , choose$a_t \in \arg\max_a Q_t(s_t, a)$ .
- Take action
$a_t$ , observe next state$s_{t+1}$ and reward$r_t$ . - Compute error
$\delta_t = r_t + \gamma \max_a Q_t(s_{t+1}, a) - Q_t(s_t, a_t)$ . - Update
-
$Q_{t+1}(s, a) = Q_t(s, a) + \alpha_t(s,a) \delta_t$ , if$s=s_t$ and$a=a_t$ -
$Q_{t+1}(s, a) = Q_{t}(s, a)$ otherwise.
Here,
Deep Q-Learning used a neural network to approximate
The parameters of the neural network are denoted by
- As input, the network takes a state
$s$ , - As output, the network returns
$Q(s, a, \theta)$ , the value of each action$a$ in state$s$ , according to the parameters$\theta$ .
The goal of Deep Q-Learning is to learn the parameters
In addition to the network with parameters
The algorithm works as follows:
-
At each time
$t$ , the agent is in state$s_t$ and has observed the transitions$(s_i, a_i, r_i, s_i')_{i=1}^{t-1}$ , which are stored in a replay buffer. -
Choose action
$a_t = \arg\max_a Q(s_t, a)$ with probability$1-\varepsilon_t$ , and$a_t$ =random action with probability$\varepsilon_t$ . -
Take action
$a_t$ , observe reward$r_t$ and next state$s_t'$ . -
Add transition
$(s_t, a_t, r_t, s_t')$ to the replay buffer. -
Sample a minibatch
$\mathcal{B}$ containing$B$ transitions from the replay buffer. Using this minibatch, we define the loss:
$$
L(\theta) = \sum_{(s_i, a_i, r_i, s_i') \in \mathcal{B}}
\left[
Q(s_i, a_i, \theta) - y_i
\right]^2
$$
where the
-
Update the parameters
$\theta$ to minimize the loss, e.g., with gradient descent (keeping$\theta^-$ fixed): $$ \theta \gets \theta + \eta \nabla_\theta L(\theta) $$ where$\eta$ is the optimization learning rate. -
Every
$N$ transitions ($t\mod N$ = 0), update target parameters:$\theta^- \gets \theta$ . -
$t \gets t+1$ . Stop if$t = T$ , otherwise go to step 2.
A2C keeps two neural networks:
- One network with paramemeters
$\theta$ to represent the policy$\pi_\theta$ . - One network with parameters
$\omega$ to represent a value function$V_\omega$ , that approximates$V^{\pi_\theta}$
At each iteration, A2C collects
Consider the following quantities, defined based on the collected transitions:
where and
$$
\mathbf{A}\omega(s_i, a_i) = \widehat{Q}(s_i, a_i) - V\omega(s_i)
$$
A2C then takes a gradient step to minimize the policy "loss" (keeping
$$ L_\pi(\theta) = -\frac{1}{M} \sum_{i=1}^M \mathbf{A}\omega(s_i, a_i) \log \pi\theta(a_i|s_i)
- \frac{\alpha}{M}\sum_{i=1}^M \sum_a \pi(a|s_i) \log \frac{1}{\pi(a|s_i)} $$
and a gradient step to minimize the value loss (keeping