Skip to content

Latest commit



311 lines (274 loc) · 16.5 KB

File metadata and controls

311 lines (274 loc) · 16.5 KB

RLs: Reinforcement Learning Algorithm Based On PyTorch.


This project includes SOTA or classic reinforcement learning (single and multi-agent) algorithms used for training agents by interacting with Unity through ml-agents Release 18 or with gym.


The goal of this framework is to provide stable implementations of standard RL algorithms and simultaneously enable fast prototyping of new methods. It aims to fill the need for a small, easily grokked codebase in which users can freely experiment with wild ideas (speculative research).


This project supports:

  • Suitable for Windows, Linux, and OSX
  • Single- and Multi-Agent training.
  • Multiple type of observation sensors as input.
  • Only need 3 steps to implement a new algorithm:
    1. policy write .py in rls/algorithms/{single/multi} directory and make the policy inherit from super-class defined in rls/algorithms/base
    2. config write .yaml in rls/configs/algorithms/ directory and specify the super config type defined in rls/configs/algorithms/general.yaml
    3. register register new algorithm in rls/algorithms/
  • Only need 3 steps to adapt to a new training environment:
    1. wrapper write environment wrappers in rls/envs/{new platform} directory and make it inherit from super-class defined in rls/envs/
    2. config write default configuration in rls/configs/{new platform}
    3. register register new environment platform in rls/envs/
  • Compatible with several environment platforms
    • Unity3D ml-agents.
    • PettingZoo
    • gym, for now only two data types are compatible——[Box, Discrete]. Support parallel training using gym envs, just need to specify --copies to how many agents you want to train in parallel.
      • environments:
      • observation -> action:
        • Discrete -> Discrete (observation type -> action type)
        • Discrete -> Box
        • Box -> Discrete
        • Box -> Box
        • Box/Discrete -> Tuple(Discrete, Discrete, Discrete)
  • Four types of Replay Buffer, Default is ER:
  • Noisy Net for better exploration.
  • Intrinsic Curiosity Module for almost all off-policy algorithms implemented.
  • Parallel training multiple scenes for Gym
  • Unified data format


method 1:

$ git clone
$ cd RLs
$ conda create -n rls python=3.8
$ conda activate rls
# Windows
$ pip install -e .[windows]
# Linux or Mac OS
$ pip install -e .

method 1:

conda env create -f environment.yaml

If using ml-agents:

$ pip install -e .[unity]

You can download the builded docker image from here:

$ docker pull keavnn/rls:latest

If anyone who wants to send a PR, plz format all code-files first:

$ pip install -e .[pr]
$ python -d ./

Implemented Algorithms

For now, these algorithms are available:

Algorithms Discrete Continuous Image RNN Command parameter
PG pg
AC ac
A2C a2c
NPG npg
TRPO trpo
PPO ppo
DQN dqn
Double DQN ddqn
Dueling Double DQN dddqn
Averaged DQN averaged_dqn
Bootstrapped DQN bootstrappeddqn
Soft Q-Learning sql
C51 c51
QR-DQN qrdqn
IQN iqn
Rainbow rainbow
DPG dpg
DDPG ddpg
TD3 td3
SAC(has V network) sac_v
SAC sac
TAC sac tac
MaxSQN maxsqn
OC oc
AOC aoc
PPOC ppoc
IOC ioc
PlaNet 1 planet
Dreamer 1 dreamer
DreamerV2 1 dreamerv2
VDN vdn
QMIX qmix
Qatten qatten
QPLEX qplex
QTRAN qtran
MADDPG maddpg
MASAC masac
CQL cql_dqn
BCQ bcq
MVE mve

1 means must use rnn or rnn is used by default.

Getting started

usage: [-h] [-c COPIES] [--seed SEED] [-r]
              [-p {gym,unity,pettingzoo}]
              [-a {maddpg,masac,vdn,qmix,qatten,qtran,qplex,aoc,ppoc,oc,ioc,planet,dreamer,dreamerv2,mve,cql_dqn,bcq,pg,npg,trpo,ppo,a2c,ac,dpg,ddpg,td3,sac_v,sac,tac,dqn,ddqn,dddqn,averaged_dqn,c51,qrdqn,rainbow,iqn,maxsqn,sql,bootstrappeddqn}]
              [-i] [-l LOAD_PATH] [-m MODELS] [-n NAME]
              [--config-file CONFIG_FILE] [--store-dir STORE_DIR]
              [--episode-length EPISODE_LENGTH] [--hostname] [-e ENV_NAME]
              [-f FILE_NAME] [-s] [-d DEVICE] [-t MAX_TRAIN_STEP]

optional arguments:
  -h, --help            show this help message and exit
  -c COPIES, --copies COPIES
                        nums of environment copies that collect data in
  --seed SEED           specify the random seed of module random, numpy and
  -r, --render          whether render game interface
  -p {gym,unity,pettingzoo}, --platform {gym,unity,pettingzoo}
                        specify the platform of training environment
  -a {maddpg,masac,vdn,qmix,qatten,qtran,qplex,aoc,ppoc,oc,ioc,planet,dreamer,dreamerv2,mve,cql_dqn,bcq,pg,npg,trpo,ppo,a2c,ac,dpg,ddpg,td3,sac_v,sac,tac,dqn,ddqn,dddqn,averaged_dqn,c51,qrdqn,rainbow,iqn,maxsqn,sql,bootstrappeddqn}, --algorithm {maddpg,masac,vdn,qmix,qatten,qtran,qplex,aoc,ppoc,oc,ioc,planet,dreamer,dreamerv2,mve,cql_dqn,bcq,pg,npg,trpo,ppo,a2c,ac,dpg,ddpg,td3,sac_v,sac,tac,dqn,ddqn,dddqn,averaged_dqn,c51,qrdqn,rainbow,iqn,maxsqn,sql,bootstrappeddqn}
                        specify the training algorithm
  -i, --inference       inference the trained model, not train policies
  -l LOAD_PATH, --load-path LOAD_PATH
                        specify the name of pre-trained model that need to
  -m MODELS, --models MODELS
                        specify the number of trails that using different
                        random seeds
  -n NAME, --name NAME  specify the name of this training task
  --config-file CONFIG_FILE
                        specify the path of training configuration file
  --store-dir STORE_DIR
                        specify the directory that store model, log and
  --episode-length EPISODE_LENGTH
                        specify the maximum step per episode
  --hostname            whether concatenate hostname with the training name
  -e ENV_NAME, --env-name ENV_NAME
                        specify the environment name
  -f FILE_NAME, --file-name FILE_NAME
                        specify the path of builded training environment of
  -s, --save            specify whether save models/logs/summaries while
                        training or not
  -d DEVICE, --device DEVICE
                        specify the device that operate Torch.Tensor
  -t MAX_TRAIN_STEP, --max-train-step MAX_TRAIN_STEP
                        specify the maximum training steps


python -s    # save model and log while train
python -p gym -a dqn -e CartPole-v0 -c 12 -n dqn_cartpole
python -p unity -a ppo -n run_with_unity -c 1

The main training loop of pseudo-code in this repo is as:

# noinspection PyUnresolvedReferences
agent.episode_reset()  # initialize rnn hidden state or something else
# noinspection PyUnresolvedReferences
obs = env.reset()
while True:
    # noinspection PyUnresolvedReferences
    env_rets = env.step(agent(obs))
    # noinspection PyUnresolvedReferences
    agent.episode_step(obs, env_rets)  # store experience, save model, and train off-policy algorithms
    obs = env_rets['obs']
    if env_rets['done']:
# noinspection PyUnresolvedReferences
agent.episode_end()  # train on-policy algorithms

Giving credit

If using this repository for your research, please cite:

  author = {Keavnn},
  title = {RLs: A Featureless Reinforcement Learning Repository},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{}},


Any questions/errors about this project, please let me know in here.