Creating a market-making RL agent in the high-frequency trading (HFT) setting for Stanford's CS234 final project. Our full paper is found here. As opposed to past RL attempts on the problem of market-making in the HFT setting, we physically implement the limit order book (LOB) when simulating our algorithm, which means that the market dynamics are explieiclty dependent on the (sum) of the market-makers actions.
Market-makers provide the liquidity of the stock market, allowing people to buy and sell stocks.
Liquidity is provided as "limit orders" in the form of asks that allow the market to buy stocks and bids that allow the market to sell stocks.
A limit order has both a price and a number of stocks, and a limit order book (LOB) contains all bids and asks provided by a market maker.
As a market-maker is only one provider of liquidity among many for stock, the midprice of a stock is decoupled from any one market and evolves over time according to a stochastic Brownian motion as the random variable
The market only interacts with the bids and asks closest to the midprice, buying the cheapest ask and selling the highest bid.
Modeling the rate of market orders is an active research field but generally follows a Poisson process with a rate
where
We simulate market trajectories over uniformly spaced time intervals of
At each timestep, we first observe the state of the market, then place an action by submitting limit orders, and finally evolve the market one step and observe how the market has changed the LOB. We update our internal wealth
Our observation at time
We take actions of the form
Our goal is to optimize the final value of the agent at the terminal time. Assuming the agent is free to liquidate all of its stocks at the current midprice (using another market), its final value is
We use a final reward of the CARA utility function used by Lim & Grose (2018):
We use the proximal policy optimization (PPO) algorithm which uses a stochastic policy. While in a the real-world market actions are deterministic, we can always back out a deterministic policy later by taking the means of the learned stochastic policy.
main.py
is run with arguments to train a MarketMaker
within the directory setup, storing results in results/
. The path of the code is as follows:
- a
Config
is created from the arguments to store all hyperparameters and flags as well as organize the files inresults/
- a
MarketMaker
is created from theConfig
MarketMaker
creates aPolicy
andMarket
, loading an existingBasePolicy
andBaseNetwork
state dict if necessaryMarket
creates anOrderBook
upon which to simulate trajectories- The
MarketMaker
is then either trained or plotted, according to args.
Arguments to Control Behavior:
--expand
number of epochs to expand a finished run to--load
filename or path to load previous model with--plot
just plot a model--good-results
plot all models located in the directory/good-results
--plot-after
,-pa
number of epochs after which to plot a full batch Arguments to Control Hyperparameters:- Training
-nt
number of timesteps per trajectory-nb
number of trajectories to sample per epoch-ne
number of epochs to train policy on-lr
float learning rate to pass to Adam optimizer--update-freq
number of gradient steps to perform each epoch, default 5--uniform
use a uniform trajectory length where when the LOB dies, all observations are zero
- Policy: default is MC returns, PPO policy update, use advantages
--td
use td-eligibility-trace returns instead of the usual monte-carlo discounted returns--noppo
,-np
use REINFORCE istead of the default PPO--noclip
,-nc
when using PPO, don't clip the policy ratio--noadv
,-na
use the returns instead of advantages
- Reward: default is an immediate dW reward
--no-immediate
,-ni
dont add any intermediate rewards--always-final
,-af
always add a final reward--add-time
,-at
add a time reward--add-inventory
,-ai
add an inventory reward
- Initial State
--book-size
,-bs
initial number of stocks in the order book, default 10000
Hyperparameters are passed around using a Config
instance that is used to initialize MarketMaker
, Policy
, Market
, and OrderBook
instances.
- default hyperparameters are stored in the kwargs for
Config.__init__()
get_config(args)
is used to initialize aConfig
from the arguments passed through tomain.py
- as epochs are updated,
Config.set_name()
is used to rename all run files with the correct current epoch
- implement more efficient gradients with sparse CSR or CSC arrays (
MaskedMarketMaker
) for faster running - incorporate testing of more reward functions, such as the CARA final utility from Lim & Grose, 2018 and the various reward schema used in Guo, Lin, and Huang, 2023.
- parameter tune TD
$\lambda$ eligibility traces - implement dynamically taking in the
$p$ past trajectories and using more layers to generate a more robust policy - create full CUDA version that uses tensor operations whenever possible, so that larger batch numbers can be used for better, more robust policies
- create full MPS version (M1), which should just be changing the
.to(device)
line
- create full MPS version (M1), which should just be changing the
- actually find the correct parameters for the market transaction rate instead of just eyeballing from Desmos