diff --git a/benchmark/ddpg/README.md b/benchmark/ddpg/README.md new file mode 100644 index 000000000..3e113c310 --- /dev/null +++ b/benchmark/ddpg/README.md @@ -0,0 +1,29 @@ +# Deep Deterministic Policy Gradient Benchmark + +This repository contains instructions to reproduce our DDPG experiments. + +## Install CleanRL + +Prerequisites: +* Python 3.8+ +* [Poetry](https://python-poetry.org) + +Install dependencies: + +```bash +git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl +git checkout v0.6.0 # pinned master version +poetry install +poetry install -E pybullet +poetry install -E mujoco +``` + +## Reproduce CleanRL's DDPG Benchmark + +Follow the commNote that you may need to overwrite the `--wandb-entity cleanrl` to your own W&B entity. + +```bash +# reproduce the classic control experiments +bash cleanrl/mujoco.sh +``` + diff --git a/benchmark/ddpg/mujoco.sh b/benchmark/ddpg/mujoco.sh new file mode 100644 index 000000000..a7d03f4fa --- /dev/null +++ b/benchmark/ddpg/mujoco.sh @@ -0,0 +1,14 @@ +# HalfCheetah-v2 +poetry run python cleanrl/ddpg_continuous_action.py --env-id HalfCheetah-v2 --track --capture-video --seed 1 --wandb-project-name cleanrl --wandb-entity openrlbenchmark +poetry run python cleanrl/ddpg_continuous_action.py --env-id HalfCheetah-v2 --track --capture-video --seed 2 --wandb-project-name cleanrl --wandb-entity openrlbenchmark +poetry run python cleanrl/ddpg_continuous_action.py --env-id HalfCheetah-v2 --track --capture-video --seed 3 --wandb-project-name cleanrl --wandb-entity openrlbenchmark + +# Walker2d-v2 +poetry run python cleanrl/ddpg_continuous_action.py --env-id Walker2d-v2 --track --capture-video --seed 1 --wandb-project-name cleanrl --wandb-entity openrlbenchmark +poetry run python cleanrl/ddpg_continuous_action.py --env-id Walker2d-v2 --track --capture-video --seed 2 --wandb-project-name cleanrl --wandb-entity openrlbenchmark +poetry run python cleanrl/ddpg_continuous_action.py --env-id Walker2d-v2 --track --capture-video --seed 3 --wandb-project-name cleanrl --wandb-entity openrlbenchmark + +# Hopper-v2 +poetry run python cleanrl/ddpg_continuous_action.py --env-id Hopper-v2 --track --capture-video --seed 1 --wandb-project-name cleanrl --wandb-entity openrlbenchmark +poetry run python cleanrl/ddpg_continuous_action.py --env-id Hopper-v2 --track --capture-video --seed 2 --wandb-project-name cleanrl --wandb-entity openrlbenchmark +poetry run python cleanrl/ddpg_continuous_action.py --env-id Hopper-v2 --track --capture-video --seed 3 --wandb-project-name cleanrl --wandb-entity openrlbenchmark diff --git a/cleanrl/ddpg_continuous_action.py b/cleanrl/ddpg_continuous_action.py index de17af781..5486542f0 100644 --- a/cleanrl/ddpg_continuous_action.py +++ b/cleanrl/ddpg_continuous_action.py @@ -1,3 +1,6 @@ +# docs and experiment results can be found at +# https://docs.cleanrl.dev/rl-algorithms/ddpg/#ddpg_continuous_actionpy + import argparse import os import random diff --git a/docs/rl-algorithms/ddpg.md b/docs/rl-algorithms/ddpg.md new file mode 100644 index 000000000..0a5721105 --- /dev/null +++ b/docs/rl-algorithms/ddpg.md @@ -0,0 +1,195 @@ +# Deep Deterministic Policy Gradient (DDPG) + + +## Overview + +DDPG is a popular DRL algorithm for continuous control. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN. + + +Original paper: + +* [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971) + +Reference resources: + +* :material-github: [sfujim/TD3](https://github.com/sfujim/TD3) + +## Implemented Variants + + +| Variants Implemented | Description | +| ----------- | ----------- | +| :material-github: [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py), :material-file-document: [docs](/rl-algorithms/ddpg/#ddpg_continuous_actionpy) | For continuous action space. Also implemented Mujoco-specific code-level optimizations | + + +Below are our single-file implementations of PPO: + +## `ddpg_continuous_action.py` + +The [ddpg.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg.py) has the following features: + +* For continuous action space. Also implemented Mujoco-specific code-level optimizations +* Works with the `Box` observation space of low-level features +* Works with the `Box` (continuous) action space + +### Usage + +```bash +poetry install +poetry install -E pybullet +python cleanrl/ddpg_continuous_action.py --help +python cleanrl/ddpg_continuous_action.py --env-id HopperBulletEnv-v0 +poetry install -E mujoco # only works in Linux +python cleanrl/ddpg_continuous_action.py --env-id Hopper-v3 +``` + +### Explanation of the logged metrics + +Running `python cleanrl/ddpg_continuous_action.py` will automatically record various metrics such as various losses in Tensorboard. Below are the documentation for these metrics: + +* `charts/episodic_return`: episodic return of the game +* `charts/SPS`: number of steps per second +* `losses/qf1_loss`: the MSE between the Q values at timestep $t$ and the target Q values at timestep $t+1$, which minimizes temporal difference. +* `losses/actor_loss`: implemented as `-qf1(data.observations, actor(data.observations)).mean()`; it is the *negative* average Q values calculated based on the 1) observations and the 2) actions computed by the actor based on these observations. By minimizing `actor_loss`, the optimizer updates the actors parameter using the following gradient (Lillicrap et al., 2016, Algorithm 1)[^1]: + +$$ \nabla_{\theta^{\mu}} J \approx \frac{1}{N}\sum_i\left.\left.\nabla_{a} Q\left(s, a \mid \theta^{Q}\right)\right|_{s=s_{i}, a=\mu\left(s_{i}\right)} \nabla_{\theta^{\mu}} \mu\left(s \mid \theta^{\mu}\right)\right|_{s_{i}} $$ + +* `losses/qf1_values`: implemented as `qf1(data.observations, data.actions).view(-1); it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over esitmations happen + + +### Implementation details + +Our [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) is based on the [`OurDDPG.py`](https://github.com/sfujim/TD3/blob/master/OurDDPG.py) from :material-github: [sfujim/TD3](https://github.com/sfujim/TD3), which presents the the following implementation difference from (Lillicrap et al., 2016)[^1]: + +1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) uses a gaussian exploration noise $\mathcal{N}(0, 0.1)$, while (Lillicrap et al., 2016)[^1] uses Ornstein-Uhlenbeck process with $\theta=0.15$ and $\sigma=0.2$. + +1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) runs the experiments using the `openai/gym` MuJoCo environments, while (Lillicrap et al., 2016)[^1] uses their proprietary MuJoCo environments. + +1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) uses the following architecture: + ```python + class QNetwork(nn.Module): + def __init__(self, env): + super(QNetwork, self).__init__() + self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod() + np.prod(env.single_action_space.shape), 256) + self.fc2 = nn.Linear(256, 256) + self.fc3 = nn.Linear(256, 1) + + def forward(self, x, a): + x = torch.cat([x, a], 1) + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x + + + class Actor(nn.Module): + def __init__(self, env): + super(Actor, self).__init__() + self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256) + self.fc2 = nn.Linear(256, 256) + self.fc_mu = nn.Linear(256, np.prod(env.single_action_space.shape)) + + def forward(self, x): + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + return torch.tanh(self.fc_mu(x)) + ``` + while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)[^1] uses the following architecture (difference highlighted): + + ```python hl_lines="4-6 9-11 19-21" + class QNetwork(nn.Module): + def __init__(self, env): + super(QNetwork, self).__init__() + self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400) + self.fc2 = nn.Linear(400 + np.prod(env.single_action_space.shape), 300) + self.fc3 = nn.Linear(300, 1) + + def forward(self, x, a): + x = F.relu(self.fc1(x)) + x = torch.cat([x, a], 1) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x + + + class Actor(nn.Module): + def __init__(self, env): + super(Actor, self).__init__() + self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400) + self.fc2 = nn.Linear(400, 300) + self.fc_mu = nn.Linear(300, np.prod(env.single_action_space.shape)) + + def forward(self, x): + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + return torch.tanh(self.fc_mu(x)) + ``` + +1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) uses the following learning rates: + + ```python + q_optimizer = optim.Adam(list(qf1.parameters()), lr=3e-4) + actor_optimizer = optim.Adam(list(actor.parameters()), lr=3e-4) + ``` + while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)[^1] uses the following learning rates: + + ```python + q_optimizer = optim.Adam(list(qf1.parameters()), lr=1e-4) + actor_optimizer = optim.Adam(list(actor.parameters()), lr=1e-3) + ``` + +1. [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) uses `--batch-size=256 --tau=0.005`, while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)[^1] uses `--batch-size=64 --tau=0.001` + + + + +### Experiment results + +PR :material-github: [vwxyzjn/cleanrl#137](https://github.com/vwxyzjn/cleanrl/pull/137) tracks our effort to conduct experiments, and the reprodudction instructions can be found at :material-github: [vwxyzjn/cleanrl/benchmark/ddpg](https://github.com/vwxyzjn/cleanrl/tree/master/benchmark/ddpg). + +Below are the average episodic returns for [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) (3 random seeds). To ensure the quality of the implementation, we compared the results against (Fujimoto et al., 2018)[^2]. + +| Environment | [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) | [`OurDDPG.py`](https://github.com/sfujim/TD3/blob/master/OurDDPG.py) (Fujimoto et al., 2018, Table 1)[^2] | [`DDPG.py`](https://github.com/sfujim/TD3/blob/master/DDPG.py) using settings from (Lillicrap et al., 2016)[^1] in (Fujimoto et al., 2018, Table 1)[^2] | +| ----------- | ----------- | ----------- | ----------- | +| HalfCheetah | 9260.485 ± 643.088 |8577.29 | 3305.60| +| Walker2d | 1728.72 ± 758.33 | 3098.11 | 1843.85 | +| Hopper | 1404.44 ± 544.78 | 1860.02 | 2020.46 | + + + +???+ info + + Note that [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) uses gym MuJoCo v2 environments while [`OurDDPG.py`](https://github.com/sfujim/TD3/blob/master/OurDDPG.py) (Fujimoto et al., 2018)[^2] uses the gym MuJoCo v1 environments. According to the :material-github: [openai/gym#834](https://github.com/openai/gym/pull/834), gym MuJoCo v2 environments should be equivalent to the gym MuJoCo v1 environments. + + Also note the performance of our `ddpg_continuous_action.py` seems to perform worse than the reference implementation on Walker2d and Hopper. This is likely due to :material-github: [openai/gym#938](https://github.com/openai/baselines/issues/938). We would have a hard time reproducing gym MuJoCo v1 environments because they have been long deprecated. + +Learning curves: + +