These notes and accompanying code were created as a presentation aid for the paper Recurrent Neural Network Grammars, Dyer et al. 2016, at the Berlin Machine Learning seminar.
The code in RNNG.py
is a reimplementation of Dyer et al. using Python bindings to DyNet, and borrows heavily from two sources:
- The original RNNG code, implemented in C++
- The Python implementation of the stack LSTM parser from Graham Neubig's NN4NLP course
Both are released under the Apache license v2.0, as is this work.
Dependencies in this project are managed with Pipenv - follow link to directions on how to install.
Once you have it, run to install dependencies:
pipenv install
Launch environment shell to run subsequent code steps:
pipenv shell
The data used for this notebook is the NLTK release of ~10% of the Penn Treebank (Marcus et al., 1994). To get the data, download the file from the NLTK data repo and unzip it in the directory data
within this repo.
To get the treebank data in the necessary format and divide into train/dev/test sets, run:
python split_training_data.py
See source code to adjust filepaths, relative size of training / dev sets, etc.
To get the oracle data sets:
python get_oracle_gen.py data/train.ptb data/train.ptb > data/train.oracle
python get_oracle_gen.py data/train.ptb data/dev.ptb > data/dev.oracle
python get_oracle_gen.py data/train.ptb data/test.ptb > data/test.oracle
To get the Brown clusters used to support word generation (generated as described in Koo et al. 2008), download them and unzip in the data
directory.
Follow these instructions to install the jupyter notebook kernel within your virtual environment (after calling pipenv shell
):
python -m ipykernel install --user --name=<rnng-notebook-[your local environment hash]>
Then launch the notebook server from within the shell:
jupyter notebook
Within the notebook, use the Kernel > Change kernel menu to use the kernel local to your virtual environment.
This software is released under the Apache license v2.0.