thu-ml · MischaPanch · Dec 6, 2023 · Oct 17, 2023 · Oct 17, 2023 · Oct 17, 2023
diff --git a/.gitignore b/.gitignore
@@ -153,6 +153,9 @@ videos/
 # might be needed for IDE plugins that can't read ruff config
 .flake8
 
+docs/notebooks/_build/
+docs/conf.py
+
 # temporary scripts (for ad-hoc testing), temp folder
 /temp
 /temp*.py
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -28,8 +28,8 @@ repos:
         pass_filenames: false
       - id: poetry-lock-check
         name: poetry lock check
-        entry: poetry lock
-        args: [--check]
+        entry: poetry check
+        args: [--lock]
         language: system
         pass_filenames: false
       - id: mypy

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -10,15 +10,14 @@ build:
   os: ubuntu-22.04
   tools:
     python: "3.11"
-  jobs:
-    pre_build:
-      - pip install .
-
-# Build documentation in the docs/ directory with Sphinx
-sphinx:
-  configuration: docs/conf.py
-# We recommend specifying your dependencies to enable reproducible builds:
-# https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
-python:
-  install:
-    - requirements: docs/requirements.txt
+  commands:
+    - mkdir -p $READTHEDOCS_OUTPUT/html
+    - curl -sSL https://install.python-poetry.org | python -
+#    - ~/.local/bin/poetry config virtualenvs.create false
+    - ~/.local/bin/poetry install --with dev
+##   Same as poe tasks, but unfortunately poe doesn't work with poetry not creating virtualenvs
+    - ~/.local/bin/poetry run python docs/autogen_rst.py
+    - ~/.local/bin/poetry run which jupyter-book
+    - ~/.local/bin/poetry run python docs/create_toc.py
+    - ~/.local/bin/poetry run jupyter-book config sphinx docs/
+    - ~/.local/bin/poetry run sphinx-build -W -b html docs $READTHEDOCS_OUTPUT/html
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -1,2 +1,4 @@
-# auto-generated content
-/api/tianshou.highlevel
+/03_api/*
+jupyter_execute
+_toc.yml
+.jupyter_cache
diff --git a/docs/tutorials/dqn.rst → docs/01_tutorials/00_dqn.rst b/docs/tutorials/dqn.rst → docs/01_tutorials/00_dqn.rst
@@ -308,7 +308,7 @@ Tianshou supports user-defined training code. Here is the code snippet:
         # train policy with a sampled batch data from buffer
         losses = policy.update(64, train_collector.buffer)
 
-For further usage, you can refer to the :doc:`/tutorials/cheatsheet`.
+For further usage, you can refer to the :doc:`/01_tutorials/07_cheatsheet`.
 
 .. rubric:: References
 

diff --git a/docs/tutorials/concepts.rst → docs/01_tutorials/01_concepts.rst b/docs/tutorials/concepts.rst → docs/01_tutorials/01_concepts.rst
@@ -339,7 +339,7 @@ Thus, we need a time-related interface for calculating the 2-step return. :meth:
 
 This code does not consider the done flag, so it may not work very well. It shows two ways to get :math:`s_{t + 2}` from the replay buffer easily in :meth:`~tianshou.policy.BasePolicy.process_fn`.
 
-For other method, you can check out :doc:`/api/tianshou.policy`. We give the usage of policy class a high-level explanation in :ref:`pseudocode`.
+For other method, you can check out :doc:`/03_api/policy/index`. We give the usage of policy class a high-level explanation in :ref:`pseudocode`.
 
 
 Collector
@@ -382,7 +382,7 @@ Trainer
 
 Once you have a collector and a policy, you can start writing the training method for your RL agent. Trainer, to be honest, is a simple wrapper. It helps you save energy for writing the training loop. You can also construct your own trainer: :ref:`customized_trainer`.
 
-Tianshou has three types of trainer: :func:`~tianshou.trainer.onpolicy_trainer` for on-policy algorithms such as Policy Gradient, :func:`~tianshou.trainer.offpolicy_trainer` for off-policy algorithms such as DQN, and :func:`~tianshou.trainer.offline_trainer` for offline algorithms such as BCQ. Please check out :doc:`/api/tianshou.trainer` for the usage.
+Tianshou has three types of trainer: :func:`~tianshou.trainer.onpolicy_trainer` for on-policy algorithms such as Policy Gradient, :func:`~tianshou.trainer.offpolicy_trainer` for off-policy algorithms such as DQN, and :func:`~tianshou.trainer.offline_trainer` for offline algorithms such as BCQ. Please check out :doc:`/03_api/trainer/index` for the usage.
 
 We also provide the corresponding iterator-based trainer classes :class:`~tianshou.trainer.OnpolicyTrainer`, :class:`~tianshou.trainer.OffpolicyTrainer`, :class:`~tianshou.trainer.OfflineTrainer` to facilitate users writing more flexible training logic:
 ::

diff --git a/docs/tutorials/batch.rst → docs/01_tutorials/03_batch.rst b/docs/tutorials/batch.rst → docs/01_tutorials/03_batch.rst
diff --git a/docs/tutorials/tictactoe.rst → docs/01_tutorials/04_tictactoe.rst b/docs/tutorials/tictactoe.rst → docs/01_tutorials/04_tictactoe.rst
diff --git a/docs/tutorials/logger.rst → docs/01_tutorials/05_logger.rst b/docs/tutorials/logger.rst → docs/01_tutorials/05_logger.rst
diff --git a/docs/tutorials/benchmark.rst → docs/01_tutorials/06_benchmark.rst b/docs/tutorials/benchmark.rst → docs/01_tutorials/06_benchmark.rst
diff --git a/docs/tutorials/cheatsheet.rst → docs/01_tutorials/07_cheatsheet.rst b/docs/tutorials/cheatsheet.rst → docs/01_tutorials/07_cheatsheet.rst
@@ -126,7 +126,7 @@ The figure in the right gives an intuitive comparison among synchronous/asynchro
 .. note::
 
     The async simulation collector would cause some exceptions when used as
-    ``test_collector`` in :doc:`/api/tianshou.trainer` (related to
+    ``test_collector`` in :doc:`/03_api/trainer/index` (related to
     `Issue 700 <https://github.com/thu-ml/tianshou/issues/700>`_). Please use
     sync version for ``test_collector`` instead.
 
@@ -478,4 +478,4 @@ By constructing a new state ``state_ = (state, agent_id, mask)``, essentially we
     act = policy(state_)
     next_state_, reward = env.step(act)
 
-Following this idea, we write a tiny example of playing `Tic Tac Toe <https://en.wikipedia.org/wiki/Tic-tac-toe>`_ against a random player by using a Q-learning algorithm. The tutorial is at :doc:`/tutorials/tictactoe`.
+Following this idea, we write a tiny example of playing `Tic Tac Toe <https://en.wikipedia.org/wiki/Tic-tac-toe>`_ against a random player by using a Q-learning algorithm. The tutorial is at :doc:`/01_tutorials/04_tictactoe`.
diff --git a/docs/01_tutorials/index.rst b/docs/01_tutorials/index.rst
@@ -0,0 +1,2 @@
+Tutorials
+=========
diff --git a/docs/02_notebooks/0_intro.md b/docs/02_notebooks/0_intro.md
@@ -0,0 +1,4 @@
+# Notebook Tutorials
+
+Here is a collection of executable tutorials for Tianshou. You can run them
+directly in colab, or download them and run them locally.
diff --git a/docs/02_notebooks/L0_overview.ipynb b/docs/02_notebooks/L0_overview.ipynb
@@ -0,0 +1,236 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "editable": true,
+    "id": "r7aE6Rq3cAEE",
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "source": [
+    "# Overview\n",
+    "In this tutorial, we use guide you step by step to show you how the most basic modules in Tianshou work and how they collaborate with each other to conduct a classic DRL experiment."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "1_mLTSEIcY2c"
+   },
+   "source": [
+    "## Run the code\n",
+    "Before we get started, we must first install Tianshou's library and Gym environment by running the commands below. Here I choose a specific version of Tianshou(0.4.8) which is the latest as of the time writing this tutorial. APIs in different versions may vary a little bit but most are the same. Feel free to use other versions in your own project."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "IcFNmCjYeIIU"
+   },
+   "source": [
+    "Below is a short script that use a certain DRL algorithm (PPO) to solve the classic CartPole-v1\n",
+    "problem in Gym. Simply run it and **don't worry** if you can't understand the code very well. That is\n",
+    "exactly what this tutorial is for.\n",
+    "\n",
+    "If the script ends normally, you will see the evaluation result printed out before the first\n",
+    "epoch is done."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "editable": true,
+    "is_executing": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "hide-cell",
+     "remove-output"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "import gymnasium as gym\n",
+    "import torch\n",
+    "\n",
+    "from tianshou.data import Collector, VectorReplayBuffer\n",
+    "from tianshou.env import DummyVectorEnv\n",
+    "from tianshou.policy import PPOPolicy\n",
+    "from tianshou.trainer import OnpolicyTrainer\n",
+    "from tianshou.utils.net.common import ActorCritic, Net\n",
+    "from tianshou.utils.net.discrete import Actor, Critic\n",
+    "\n",
+    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "editable": true,
+    "is_executing": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# environments\n",
+    "env = gym.make(\"CartPole-v1\")\n",
+    "train_envs = DummyVectorEnv([lambda: gym.make(\"CartPole-v1\") for _ in range(20)])\n",
+    "test_envs = DummyVectorEnv([lambda: gym.make(\"CartPole-v1\") for _ in range(10)])\n",
+    "\n",
+    "# model & optimizer\n",
+    "net = Net(env.observation_space.shape, hidden_sizes=[64, 64], device=device)\n",
+    "actor = Actor(net, env.action_space.n, device=device).to(device)\n",
+    "critic = Critic(net, device=device).to(device)\n",
+    "actor_critic = ActorCritic(actor, critic)\n",
+    "optim = torch.optim.Adam(actor_critic.parameters(), lr=0.0003)\n",
+    "\n",
+    "# PPO policy\n",
+    "dist = torch.distributions.Categorical\n",
+    "policy = PPOPolicy(\n",
+    "    actor=actor,\n",
+    "    critic=critic,\n",
+    "    optim=optim,\n",
+    "    dist_fn=dist,\n",
+    "    action_space=env.action_space,\n",
+    "    action_scaling=False,\n",
+    ")\n",
+    "\n",
+    "\n",
+    "# collector\n",
+    "train_collector = Collector(policy, train_envs, VectorReplayBuffer(20000, len(train_envs)))\n",
+    "test_collector = Collector(policy, test_envs)\n",
+    "\n",
+    "# trainer\n",
+    "result = OnpolicyTrainer(\n",
+    "    policy=policy,\n",
+    "    batch_size=256,\n",
+    "    train_collector=train_collector,\n",
+    "    test_collector=test_collector,\n",
+    "    max_epoch=10,\n",
+    "    step_per_epoch=50000,\n",
+    "    repeat_per_collect=10,\n",
+    "    episode_per_test=10,\n",
+    "    step_per_collect=2000,\n",
+    "    stop_fn=lambda mean_reward: mean_reward >= 195,\n",
+    ")\n",
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "G9YEQptYvCgx",
+    "is_executing": true,
+    "outputId": "2a9b5b22-be50-4bb7-ae93-af7e65e7442a"
+   },
+   "outputs": [],
+   "source": [
+    "# Let's watch its performance!\n",
+    "policy.eval()\n",
+    "result = test_collector.collect(n_episode=1, render=False)\n",
+    "print(\"Final reward: {}, length: {}\".format(result[\"rews\"].mean(), result[\"lens\"].mean()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "xFYlcPo8fpPU"
+   },
+   "source": [
+    "## Tutorial Introduction\n",
+    "\n",
+    "A common DRL experiment as is shown above may require many components to work together. The agent, the\n",
+    "environment (possibly parallelized ones), the replay buffer and the trainer all work together to complete a\n",
+    "training task.\n",
+    "\n",
+    "<div align=center>\n",
+    "<img src=\"https://tianshou.readthedocs.io/en/master/_images/pipeline.png\">\n",
+    "\n",
+    "</div>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kV_uOyimj-bk"
+   },
+   "source": [
+    "In Tianshou, all of these main components are factored out as different building blocks, which you\n",
+    "can use to create your own algorithm and finish your own experiment.\n",
+    "\n",
+    "Building blocks may include:\n",
+    "- Batch\n",
+    "- Replay Buffer\n",
+    "- Vectorized Environment Wrapper\n",
+    "- Policy (the agent and the training algorithm)\n",
+    "- Data Collector\n",
+    "- Trainer\n",
+    "- Logger\n",
+    "\n",
+    "\n",
+    "Check this [webpage](https://tianshou.readthedocs.io/en/master/tutorials/dqn.html) to find jupyter-notebook-style tutorials that will guide you through all these\n",
+    "modules one by one. You can also read the [documentation](https://tianshou.readthedocs.io/en/master/) of Tianshou for more detailed explanation and\n",
+    "advanced usages."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "S0mNKwH9i6Ek"
+   },
+   "source": [
+    "## Further reading"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "M3NPSUnAov4L"
+   },
+   "source": [
+    "### What if I am not familiar with the PPO algorithm itself?\n",
+    "As for the DRL algorithms themselves, we will refer you to the [Spinning up documentation](https://spinningup.openai.com/en/latest/algorithms/ppo.html), where they provide\n",
+    "plenty of resources and guides if you want to study the DRL algorithms. In Tianshou's tutorials, we will\n",
+    "focus on the usages of different modules, but not the algorithms themselves."
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}