World Models

TLDR; The authors trains a novel model based Reinforcement Learning agent. They use a VAE (V) model to encode state represenation, a MDN-RNN (M) model to predict the future and an CMA-ES (C) controller to take action in the environment. The three models are trained independently, without feedback from another. The VAE learns to encode 2D pixel representation of the game pixels into a latent vector z. This latent vector, along with corresponding action, is used to train the RNN, whose primary job is to predict the future state of the game based on it's hidden representation h. Finally, both the representation of current spatial representaion, z and temporal representaion, h, are used by C, to take action in the environment.

Key Points

Data Generation

Random rollouts
- also iterative training procedure, if required by complex envs
Stored observation and corresponding action

VAE (V)

VAE is essentailly unsupervided way to learn your input domain
CNN VAE uses environment observation frame as input
Output is compressed latent representation z, which stores spatial representation of the current time

MDN-RNN (M)

Use previous action and z, to predict the future z
- action is required because only current z might not be sufficient to predict the future
Mixture Density Network (MDN) LSTM
- mixture of gaussians output from RNN
- new z is sampled for the gaussian
- since prior of VAE is gaussian too, therefore MDN be useful
RNN represents temporal representations
- an alternative to stacking past frames, as in other RL menthods

Contoller C

A very compact C is trained
- a single layer neural network
- only 867 papemeters
Uses both z from V and h from M to predict actions
The hardwork of understanding the environment is already done by V and M
- only needs to predict the action
- seperation of duties to different components

Experiments

10,000 random rollouts
VAE
- 1 epochs
- 32 z dim
RNN
- 20 epochs
- 256 hidden units

Thoughts

RNN learn temporal information, should be used in other RL too.
Independent training of all 3
- therefore it's PO-MDP
- therefore Markov property do not hold, since the agent (C), sees only the compressed representation
Independent training might not be the best way to move forward in RL
- see discussion here and here
Random exploration might not be able to collect good observation for complex games.
The authors provide z, the current representation, and h, the hidden state which predicts z
- so inturn, isn't both z and h stores the same information?
- see discussion here, along with lots of other ideas by the author

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

world-models.md

world-models.md

World Models

Key Points

Data Generation

VAE (V)

MDN-RNN (M)

Contoller C

Experiments

Thoughts

Files

world-models.md

Latest commit

History

world-models.md

File metadata and controls

World Models

Key Points

Data Generation

VAE (V)

MDN-RNN (M)

Contoller C

Experiments

Thoughts