TLDR; The authors trains a novel model based Reinforcement Learning agent. They use a VAE (V) model to encode state represenation, a MDN-RNN (M) model to predict the future and an CMA-ES (C) controller to take action in the environment. The three models are trained independently, without feedback from another. The VAE learns to encode 2D pixel representation of the game pixels into a latent vector z. This latent vector, along with corresponding action, is used to train the RNN, whose primary job is to predict the future state of the game based on it's hidden representation h. Finally, both the representation of current spatial representaion, z and temporal representaion, h, are used by C, to take action in the environment.
- Random rollouts
- also iterative training procedure, if required by complex envs
- Stored observation and corresponding action
- VAE is essentailly unsupervided way to learn your input domain
- CNN VAE uses environment observation frame as input
- Output is compressed latent representation z, which stores spatial representation of the current time
- Use previous action and z, to predict the future z
- action is required because only current z might not be sufficient to predict the future
- Mixture Density Network (MDN) LSTM
- mixture of gaussians output from RNN
- new z is sampled for the gaussian
- since prior of VAE is gaussian too, therefore MDN be useful
- RNN represents temporal representations
- an alternative to stacking past frames, as in other RL menthods
- A very compact C is trained
- a single layer neural network
- only 867 papemeters
- Uses both z from V and h from M to predict actions
- The hardwork of understanding the environment is already done by V and M
- only needs to predict the action
- seperation of duties to different components
- 10,000 random rollouts
- VAE
- 1 epochs
- 32 z dim
- RNN
- 20 epochs
- 256 hidden units
- RNN learn temporal information, should be used in other RL too.
- Independent training of all 3
- therefore it's PO-MDP
- therefore Markov property do not hold, since the agent (C), sees only the compressed representation
- Independent training might not be the best way to move forward in RL
- Random exploration might not be able to collect good observation for complex games.
- The authors provide z, the current representation, and h, the hidden state which predicts z
- so inturn, isn't both z and h stores the same information?
- see discussion here, along with lots of other ideas by the author