# Concept Tagging Sequences using Transfer Learning and Named Entity Recognition Tools Neural Networks have produced remarkable results in many Natural Language Processing tasks, for example, when tasked to assigning concepts to words of a sentence. Their successes are made possible by employing good word representations (embeddings) which a Neural Network can understand. This work evaluates several newly developed pre-trained embeddings (ELMo, BERT and ConceptNet) on the task of tagging sequences from the movie domain. We then compare the measurements with previous results of the literature. This repository contains the code for the second assignment of the [Language Understanding Systems](http://disi.unitn.it/~riccardi/page7/page13/page13.html) course at [University of Trento](https://unitn.it) teached by [Prof. Giuseppe Riccardi](http://disi.unitn.it/~riccardi). The final report can be found at this link [here](report/giovanni_de_toni_197814.pdf). ## Description This repository is structured as follow: * `concept-tagging-with-neural-networks`: this directory contains the code of the original work on which we based the project. This is loaded as a submodule; * `data`: this directory contains the datasets/embeddings used for the project. They are saved with Git-LFS in a compressed format; * `data_analysis`: this scripts contains utility to analyze the various datasets; * `report`: it contains the report; The scripts you can find here: * `collect_results.sh`: collect the results and generate a complete file; * `generate_result_table.py`: generate a table from the collected results; * `train_all_models.py`: script to run the experiments of a HPC cluster. It produces a series of jobs and the results will be saved in the directory `results`; * `submit_jobs.sh`: submit a job inside the cluster; * `train_models.sh`: script which runs exactly one run of the models. The file `complete_results.txt` contains the results of all the experiments perfomed (more specifically, the max/min/mean F1 scores for all the runs). ## Install This project was written using **Python 3.7**, **bash** and **conda** for the virtual environment. Everything was tested on **Ubuntu 18.04** To install all the dependencies needed please run the following commands. ```bash git clone https://github.com/geektoni/concept-tagging-nn-spacy cd concept-tagging-nn-spacy git submodule update --init conda env create -f environment.yml conda activate ctnns python -m spacy download en_core_web_sm ``` The next step will be downloading the datasets. You will need to install **Git-LFS** to be able to do so. Please refer to the official instructions. Once installed, please run the following commands. ```bash cd concept-tagging-nn-spacy git lfs fetch git lfs checkout ``` If everything worked correctly, you should be able to have a working environment were to run the various scripts/experimens. If you encounter any errors, please feel free to open an issue on Github. **IMPORTANT: Be aware that the entire repository will take at least 2/3 GB of disk space on your machine. Deleting the .git directory after having downloaded all the material has proven itself to be beneficial to reduce the disk usage (as long as you do not want to contribute actively).** ## Usage In order to replicate the experiments, you can follow the instructions of the original report for the classical usage. Below you can see some examples of the experiments with the ConceptNet, BERT and ELMO embeddings. **Run LSTM with ConceptNet and NER+POS+CHAR features** ```bash conda activate ctnns cd concept-tagging-nn-spacy/concept-tagging-with-neural-networks/src python run_model.py \ --train ../../data/train.bz2 \ --test ../../data/test.bz2 \ --w2v ../../data/embeddings/conceptnet-300.bz2 \ --model lstm \ --epochs 15 \ --write_results=../../results/result.txt \ --bidirectional \ --more-features \ --embedder none \ --batch 20 \ --lr 0.001 \ --hidden_size 200 \ --drop 0.7 \ --unfreeze \ --c2v ../data/movies/c2v_20.pickle ``` **Run LSTM-CRF with ELMO (fine-tuned) and NER+POS+CHAR features** ```bash conda activate ctnns cd concept-tagging-nn-spacy/concept-tagging-with-neural-networks/src python run_model.py \ --train ../../data/train_elmo.bz2 \ --test ../../data/test_elmo.bz2 \ --w2v ../data/movies/w2v_trimmed.pickle \ --model lstmcrf \ --epochs 10 \ --write_results=../../results/result.txt \ --bidirectional \ --more-features \ --embedder elmo-combined \ --batch 1 \ --lr 0.001 \ --hidden_size 200 \ --drop 0.7 \ --unfreeze \ --c2v ../data/movies/c2v_20.pickle ``` **Run LSTM-CRF with BERT and NER+POS+CHAR features** ```bash conda activate ctnns cd concept-tagging-nn-spacy/concept-tagging-with-neural-networks/src python run_model.py \ --train ../../data/train_bert.bz2 \ --test ../../data/test_bert.bz2 \ --w2v ../data/movies/w2v_trimmed.pickle \ --model lstmcrf \ --epochs 10 \ --write_results=../../results/result.txt \ --bidirectional \ --more-features \ --embedder elmo-combined \ --batch 1 \ --lr 0.001 \ --hidden_size 200 \ --drop 0.7 \ --unfreeze \ --c2v ../data/movies/c2v_20.pickle ``` ### Cluster usage To replicate exactly the experiments we run, you can use the `train_all_models.py` script which will generate several jobs on an HPC cluster by using `qsub`. ## License This software is distributed under MIT license (see LICENSE file). ## Authors - Giovanni De Toni, [giovanni.detoni@studenti.unitn.it](mailto:giovanni.detoni@studenti.unitn.it)