Seq2Seq (Encoder-Decoder) wiht Attention Mechanism for Grammar Correction in Keras.
At first, we should create our parallel dataset for training our model. In preprocess folder, lang8 and nucle modules convert each dataset into proper format. Lang8 dataset is very noisy, so I decided to do small preprocessing on that. I remove non-ascii characters, reduce length of character with 3 or more with 1 (e.g token like !!!!!!!
convert to !
), and remove unnecessary punctuation (all puntuation except {',','.','-'}
).
At the final preprocessing step, I do some data augmentation. In each pair (source, target), in addition to existing error, I inject some typo/grmmatical error into the source samples. Things I do in this step include:
- Dropout token
- Modal replacement
- Misspelling tokens
- Change tense of verbs
- Change singularity/pluarality of nouns
- Change preposition
Accourding to this paper, above cases will cover most of the errors in English learner writings.
In training step, I used famous seq2seq Attention model here. The best hyper-parameters for seq2seq explored by the team at google in "Massive Exploration of Neural Machine Translation Architectures" paper. I used one layer encoder/decoder to keep things as simple as posible. It can be easily extend to 4 layer encoder/decoder famework (considering regularization and dropout).
- git clone https://github.com/hadifar/GrammarCorrection.git
- cd GrammarCorrection
- virtualenv venv
- source venv/bin/activate
- sudo pip2 install -r requirements.txt
- mkdir data
- cd data
- Download lang8 and NUCLE and put them in data folder.
- cd ..
- cd preprocess
- sh preprocess_script.sh
- cd ..
- cd models
- donwnload fasttext pretrained embeddings and put it in data/embedding folder
- sh train_script.sh
- That's all :)
- Look into the riminder.ipynb in the root directory. (do not forget to put the dataset in your google drive).
- Use character ngram feature
- Use language model for checking final output