A Tensorflow implementation of a BiLSTM-MaxPooling Siamese Network [1] for Paraphrase Detection on the Quora question pairs dataset [2]. The encoder can be pretrained with a Sequential Denoising Autoencoder (SDAE) to tackle the semi-supervised setting (as in [3]).
The code is written in Python 3 with the following dependencies:
- Tensorflow (== 1.4)
- numpy
- pandas
- NLTK
- Gensim
A Dockerfile with the corresponding GPU environment and jupyterlab is provided in /docker.
cd docker
nvidia-docker build -t sqm .
nvidia-docker run -it -p 8888:8888 -v <absolute_path>/:notebooks/ sqm
cd data
unzip data.zip -d .
./get_glove.sh
This will unzip Quora's dataset and download GloVE.6B.
The provided split is the standard partition from [4] in the original Quora format.
None of question id and pair id match the original release from Quora.
Supervised Siamese Network:
python training.py -m siamese
Semi-supervised SDAE-Siamese Network:
# -res is the size of the labeled seed of question pairs
python training.py -m hybrid -res 1000
[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
[2] Quora question pairs dataset
[3] D. Shen, Y. Zhang, R. Henao, Q. Su, L. Carin, Deconvolutional Latent-Variable Model for Text Sequence Matching
[4] Z. Wang, W. Hamza, R.Florian, Bilateral Multi-Perspective Matching for Natural Language Sentences