Deep reinforcement learning, referred to as just reinforcement learning (RL) from now on, is a class of methods in the larger field of deep learning that lets an artificial intelligence agent explore the interactions with a surrounding environment. While doing this, the agent receives reward signals for its actions and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}silver2017mastering
, and it bears promise for engineering applications such as robotics.
The setup for RL generally consists of two parts: the environment and the agent. The environment receives actions
---
height: 200px
name: rl-overview
---
Reinforcement learning is formulated in terms of an environment that gives observations in the form of states and rewards to an agent. The agent interacts with the environment by performing actions.
In its simplest form, the learning goal for reinforcement learning tasks can be formulated as
$$ \text{arg max}{\theta} \mathbb{E}{a \sim \pi(;s,\theta_p)} \big[ \sum_t r_t \big], $$ (rl-learn-l2)
where the reward at time
During the learning process the central aim of RL is to uses the combined information of state, action and corresponding rewards to increase the cumulative intensity of reward signals over each trajectory. To achieve this goal, multiple algorithms have been proposed, which can be roughly divided into two larger classes: policy gradient and value-based methods {cite}sutton2018rl
.
In vanilla policy gradient methods, the trained neural networks directly select actions
Value-based methods, such as Q-Learning, on the other hand work by optimizing a state-action value function, the so-called Q-Function. The network in this case receives state
In addition, actor-critic methods combine elements from both approaches. Here, the actions generated by a policy network are rated based on a corresponding change in state potential. These values are given by another neural network and approximate the expected cumulative reward from the given state. Proximal policy optimization (PPO) {cite}schulman2017proximal
is one example from this class of algorithms and is our choice for the example task of this chapter, which is controlling Burgers' equation as a physical environment.
As PPO methods are an actor-critic approach, we need to train two interdependent networks: the actor, and the critic. The objective of the actor inherently depends on the output of the critic network (it provides feedback which actions are worth performing), and likewise the critic depends on the actions generated by the actor network (this determines which states to explore).
This interdependence can promote instabilities, e.g., as strongly over- or underestimated state values can give wrong impulses during learning. Actions yielding higher rewards often also contribute to reaching states with higher informational value. As a consequence, when the - possibly incorrect - value estimate of individual samples are allowed to unrestrictedly affect the agent's behavior, the learning progress can collapse.
PPO was introduced as a method to specifically counteract this problem. The idea is to restrict the influence that individual state value estimates can have on the change of the actor's behavior during learning. PPO is a popular choice especially when working on continuous action spaces. This can be attributed to the fact that it tends to achieve good results with a stable learning progress, while still being comparatively easy to implement.
More specifically, we will use the algorithm PPO-clip {cite}schulman2017proximal
. This PPO variant sets a hard limit for the change in behavior caused by singular update steps. As such, the algorithm uses a previous network state (denoted by a subscript
%Here,
The actor computes a policy function returning the probability distribution for the actions conditioned by the current network parameters
$$\begin{aligned} \text{arg max}{\theta} \mathbb{E}{a \sim \pi(;s,\theta_p)} \Big[ \text{min} \big( \frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p)} A(s, a; \phi), \text{clip}(\frac{\pi(a;s,\theta)}{\pi(a;s,\theta_p)}, 1-\epsilon, 1+\epsilon) A(s, a; \phi) \big) \Big] \end{aligned}$$
As the actor network is trained to provide the expected value, at training time an additional standard deviation is used to sample values from a Gaussian distribution around this mean. It is decreased over the course of the training, and at inference time we only evaluate the mean (i.e. a distribution with variance 0).
The critic is represented by a value function
$$\begin{aligned} \text{arg min}{\phi}\mathbb{E}{a \sim \pi(;s,\theta_p)}[A(s, a; \phi)^2] \ , \end{aligned}$$
where the advantage function schulman2015high
to compute
Here
The
Reinforcement learning is widely used for trajectory optimization with multiple decision problems building upon one another. However, in the context of physical systems and PDEs, reinforcement learning algorithms are likewise attractive. In this setting, they can operate in a fashion that's similar to supervised single shooting approaches by generating full trajectories and learning by comparing the final approximation to the target.
Still, the approaches differ in terms of how this optimization is performed. For example, reinforcement learning algorithms like PPO try to explore the action space during training by adding a random offset to the actions selected by the actor. This way, the algorithm can discover new behavioral patterns that are more refined than the previous ones.
The way how long term effects of generated forces are taken into account can also differ for physical systems. In a control force estimator setup with differentiable physics (DP) loss, as discussed e.g. in {doc}diffphys-code-burgers
, these dependencies are handled by passing the loss gradient through the simulation step back into previous time steps. Contrary to that, reinforcement learning usually treats the environment as a black box without gradient information. When using PPO, the value estimator network is instead used to track the long term dependencies by predicting the influence any action has for the future system evolution.
Working with Burgers' equation as physical environment, the trajectory generation process can be summarized as follows. It shows how the simulation steps of the environment and the neural network evaluations of the agent are interleaved:
The $$ superscript (as usual) denotes a reference or target quantity, and hence here $\mathbf{u}^$ denotes a velocity target. For the continuous action space of the PDE,
The reward is calculated in a similar fashion as the Loss in the DP approach: it consists of two parts, one of which amounts to the negative square norm of the applied forces and is given at every time step. The other part adds a punishment proportional to the
In the following, we'll describe a way to implement a PPO-based RL training for physical systems. This implementation is also the basis for the notebook of the next section, i.e., {doc}reinflearn-code
. While this notebook provides a practical example, and an evaluation in comparison to DP training, we'll first give a more generic overview below.
To train a reinforcement learning agent to control a PDE-governed system, the physical model has to be formalized as an RL environment. The stable-baselines3 framework, which we use in the following to implement a PPO training, uses a vectorized version of the OpenAI gym environment. This way, rollout collection can be performed on multiple trajectories in parallel for better resource utilization and wall time efficiency.
Vectorized environments require a definition of observation and action spaces, meaning the in- and output spaces of the agent policy. In our case, the former consists of the current physical states and the goal states, e.g., velocity fields, stacked along their channel dimension. Another channel is added for the elapsed time since the start of the simulation divided by the total trajectory length. The action space (the output) encompasses one force value for each cell of the velocity field.
The most relevant methods of vectorized environments are reset
, step_async
, step_wait
and render
. The first of these is used to start a new trajectory by computing initial and goal states and returning the first observation for each vectorized instance. As these instances in other applications are not bound to finish trajectories synchronously, reset
has to be called from within the environment itself when entering a terminal state. step_async
and step_wait
are the two main parts of the step
method, which takes actions, applies them to the velocity fields and performs one iteration of the physics models. The split into async and wait enables supporting vectorized environments that run each instance on separate threads. However, this is not required in our approach, as phiflow handles the simulation of batches internally. The render
method is called to display training results, showing reconstructed trajectories in real time or rendering them to files.
Because of the strongly differing output spaces of the actor and critic networks, we use different architectures for each of them. The network yielding the actions uses a variant of the network architecture from Holl et al. {cite}holl2019pdecontrol
, in line with the
In the example implementation of the next chapter, the BurgersTraining
class manages all aspects of this training internally, including the setup of agent and environment and storing trained models and monitor logs to disk. It also includes a variant of the Burgers' equation environment described above, which, instead of computing random trajectories, uses data from a predefined set. During training, the agent is evaluated in this environment in regular intervals to be able to compare the training progress to the DP method more accurately.
The next chapter will use this BurgersTraining
class to run a full PPO scenario, evaluate its performance, and compare it to an approach that uses more domain knowledge of the physical system, i.e., the gradient-based control training with a DP approach.