Implementation

This page covers the implementation details of the project.

Running the controller

The controller can be run in two modes:

"training" mode
"eval" mode

This is defined by the argument provided to the controller. The controller will load the model if available or create a new randomly initialized one otherwise.

In eval mode, the controller starts and uses the model to control the robot indefinitely.

In training mode, the whole training pipeline is activated. In this case, the robot will achieve a series of trials (referred to as "rollouts" in the code) during which experiences will be stored in the form of "episodes" at each time step and used at the end of each trial to train the policy network as well as the value function estimator's network. At each time step, following actions are taken:

The state of the robot is measured.
The associated reward is computed.
The policy is fed the current state and produces an action.
The generated action is sent to the motors.
The state and action are stored in an "episode".
If the timeout is reached or if a collision is detected, the trial is terminated, the last pieces of information are stored and the network is trained on the collected data during the trial.
If point 6 was true, the simulation is then reset and the robot is moved to a predefined location in the arena and a new trial is restarted.

The training will last for as many trials defined in vpg_parameters.hpp.

Implementation details

State

The state of the robot is defined by the reading of its eight IR sensors surrounding it (normalized between [0,1], 0 meaning no obstacle and 1 an obstacle is close) as well as the current wheel velocities (normalized between [0,1], 0 meaning max negative speed, 1 max positive speed and 0.5 stop).

Note the interface implemented to drive the robot expects normalized commands between [-1,1], not [0,1].

Reward

The reward is computed based on the previous and current state of the robot. It encourages the robot to drive forward and to avoid obstacles. Currently, a continuous reward is sent to the robot is driving forward while no obstacles are detected. It receives positive feedback for clearing obstacles and negative feedback when hitting something. Hitting an obstacle or a wall terminates the trial.

Note the current reward should be improved since it encourages the robot to oscillates near obstacles to accumulate more reward by repeatedly clearing obstacles!

Network

There are actually two feed forward, fully connected networks:

Policy (composed of three layers of 8, 4 and 2 neurons respectively, with ReLU, ReLU and sigmoid activations)
Value function estimator (composed of three layers of 5, 3 and 1 neurons respectively, with ReLU, ReLU and linear activations)

The Policy is responsible for choosing actions as a function of the state while the Value estimator estimates the value of the same state. Both networks support each others during training.

Episodes

The data gathered during a trial is stored as a series of episodes, which store the trajectory of the robot (not its physical trajectory!). This is a simple structure storing for a given time t, the state and action taken as well as the state and reward perceived at time t+1. The episodes are stored and then used to train the networks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly