The Fcm is a compositional embedding model for relation classification combining unlexicalized linguistic context and word embeddings
Fcm paper (Gormley et al. 2015): http://www.cs.cmu.edu/~mgormley/papers/gormley+yu+dredze.emnlp.2015.pdf
The main purpose of this repository is to run the FCM for the relation classification task on several corpus, using multiples word embeddings and to compute results (such as micro-f1, macro-f1, weighted-f1 etc.)
This repository is made of multiple pieces, the heart being the FCM C++ implementation by Mo Yu
I have build two python scripts around it:
-
1- The first (main) one is used to run the FCM on a chosen corpus, tuning learning rate and number of epochs, using one or many word embeddings and finally getting results in a file
-
2- The second one is used to convert a corpus from a Semeval 2010 format to a format usable by the FCM (adding various taggs, dependency path information etc.), if you ever wish to use my work on another corpus and if you can easily have your corpus in a Semeval 2010 format ..
These 2 scripts are INDEPENDANT, if you wish to just use one of them no need to care for installation of the other
I already provide Semeval 2010, Semeval 2018 and reAce 2005 corpus with all results using several word embeddings (see results/macro_f1
folder) so the conversion script may not be that useful
This repository is for Windows use, a linux version might come in a near future and it should be relatively easy to make it yourself
For the main script you need python 3 and the following packages:
{ numpy
, sklearn
, scipy
}
For the conversion script you need python 3 and the following packages:
{ numpy
, scipy
, spacy
, networkx
}
- To use the main script, you first need to compile the FCM code, open a terminal in
fcm
folder andmake
, since this repo is for Windows I recommend using MinGW (don't forget to add it to your PATH environment variable)
Example: make with MinGW
mingw32-make
- To use the conversion script, you need to compile the SST code (which is a tagger), open a terminal in
data/corpus/raw_to_formated_script/sst
folder andmake
(as before I recommend MinGW andmingw32-make
)
For this script to run you also need gzip (precisely you need gunzip, its decompression tool) installed for command line usage, you can get it here (don't forget to add it to your PATH environment variable), gunzip might not be recognized as a terminal command, please refer to my Stackoverflow answer in that case
In conclusion the installation might seem complicated but for the main script to run you just need to "make" the FCM and the few python libraries listed, for the conversion script you need to "make" the SST and get gunzip as a terminal command
Open a terminal in the root
folder and execute:
python fcm_global.py <train data> <test_data> <epochs> <learning rate> [word embeddings]
Example:
python fcm_global.py semeval2018_train semeval2018_test 30 0.005
Get results in the results/macro_f1
folder
Notes:
- If you do not write a word embedding argument, it will run on every word embeddings available in the
data/word_emb
folder - Train data and test data files have to be in the
data/corpus/formated folder
- In this repo I only provide one small word embeddings (github size restriction) but you can get bigger and better performing on my drive
To convert a corpus in Semeval 2010 format to a format usable by FCM (see data/corpus/raw_to_formated_script.py
comments for more details)
Open a terminal in the data/corpus/raw_to_formated_script
folder and execute:
python raw_to_formated.py <file to convert>
Example:
python raw_to_formated.py semeval2018_train
Get results in the data/corpus/formated
folder
Notes:
- File to convert has to be in the
data/corpus/raw folder
and of course in a Semeval 2010 format - This script is available in a jupyter notebook version (in french) for better visual understanding in the
...notebook
folder
Do not hesitate to contact me if you need some help
I let the Semeval 2010 official scorer il the results
folder if you ever need to use it
Valentin Macé – LinkedIn – YouTube – Twitter -valentin.mace@kedgebs.com
Distributed under the MIT license. See LICENSE
for more information.