DiMo is a collection of scripts for my bachelor's thesis Comparison and Evaluation of Models for Distributional Semantics.
Take a look at notebooks on the thesis's official website to see how these scripts can be used:
Notice! Parts of the code require internal Sketch Engine's packages manatee
and wmap
.
Other required packages are:
- numpy
- scipy
- gensim
- sklearn
The code runs on python 2.7.
Unlike the original implementation, the one in this project operates directly on a co-occurrence matrix.
If you have a corpus with compiled word sketches (let's say it is called bnc2), use wm2thes.py
script to create such a matrix:
python wm2thes.py bnc2 bnc2-matrix
This creates 4 files representing a sparse word x (relation, word) matrix:
- bnc2-matrix-target2i.pickle # dictionary: words to indices
- bnc2-matrix-rows.npy # row indices
- bnc2-matrix-cols.npy # col indices
- bnc2-matrix-vals.npy # values
Now that you have the matrix, you may decide which similarity measure to use.
from models import SkEThesSKE, SkEThesCOS
model_ske = SkEThesSKE("bnc2-matrix")
model_cos = SkEThesCOS("bnc2-matrix")
Now you can call functions like similarity
, similarities
, most_similar
or eval_analogy
to evaluate the models on datasets of analogy queries.
There is also a wrapper for the original implementation in oskethes.py
, but the interface is a bit different as it is just a collection of several word similarities, the co-occurrence matrix is gone, similarities < 0.05 are gone...
If you have a corpus in text file (one line -- one sentence), you may create a similar model with linear contexts (weighted symmetric context window):
python coocs.py plain-bnc.txt plain-bnc-matrix 20 5
- 20 is the minimum word frequency
- 5 is the context window size
The matrix will contain raw co-occurrence counts, so you may consider using some weighting.
from models import SkEThesCOS
from weightings import ppmi
model_ske = SkEThesSKE("plain-bnc-matrix", weighting=ppmi)
For Word2Vec models, this project wraps over gensim package. Everything that you can open with:
from gensim.models import Word2Vec
model = Word2Vec(model_name)
... you can open also with:
from models import Word2Vec
model = Word2Vec(model_name)
The interface as well as the evaluation script stays the same as in SkEThesXXX
.
evaluation = model.eval_analogy(dataset)
Dataset is a dictionary category: list_of_queries
. Each query should be a tuple like:
("paris", "france", "london", {"england", "britain", "uk"})
You may configure the evaluation in various ways:
from formulas import mul
my_mul = lambda a, b, aa: mul(a, b, aa, coeff=0.05)
evaluation = model.eval_analogy(dataset, topn=5, exclusion_trick=False, formula=my_mul)
And see the results:
evaluation[category]["acc"] # 0.0--1.0
evaluation[category]["acc_top1"] # 0.0--1.0
evaluation[category]["oov"] # nb of queries containing an oov word
evaluation[category]["oovs"] # set of oov words
evaluation[category]["queries"] # list of queries and their candidate answers (excluding queries with oov words)