Skip to content

Commit

Permalink
Added info about code location.
Browse files Browse the repository at this point in the history
  • Loading branch information
minimalparts committed Apr 13, 2018
1 parent eca6cb1 commit d08c961
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 4 deletions.
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,31 @@ A. Herbelot and M. Baroni. 2017. High-risk learning: Acquiring new word vectors

Distributional semantics models are known to struggle with small data. It is generally accepted that in order to learn 'a good vector' for a word, a model must have sufficient examples of its usage. This contradicts the fact that humans can guess the meaning of a word from a few occurrences only. In this paper, we show that a neural language model such as Word2Vec only necessitates minor modifications to its standard architecture to learn new terms from tiny data, using background knowledge from a previously learnt semantic space. We test our model on word definitions and on a nonce task involving 2-6 sentences' worth of context, showing a large increase in performance over state-of-the-art models on the definitional task.

# Pre-requisites
## A note on the code
We have had queries about *where* exactly the Nonce2Vec code resides. Since it is a modification of the original gensim Word2Vec model, it is located in the gensim/models directory, confusingly still under the name *word2vec.py*. All modifications described in the paper are implemented in that file. Note that there is no C implementation of Nonce2Vec, so the program runs on standard numpy. Also, only skipgram is implemented -- the cbow functions in the code are original Word2Vec.


## Pre-requisites
You will need a pre-trained gensim model. You can go and train one yourself, using the gensim repo at [https://github.com/rare-technologies/gensim](https://github.com/rare-technologies/gensim), or simply download ours, pre-trained on Wikipedia:

`wget http://clic.cimec.unitn.it/~aurelie.herbelot/wiki_all.model.tar.gz`

If you use our tar file, the content should be unpacked into the models/ directory of the repo.

# Running the code
## Running the code

Here is an example of how to run the code on the test set of the definitional dataset, with the best identified parameters from the paper:

`python test_def_nonces.py models/wiki_all.sent.split.model data/definitions/nonce.definitions.300.test 1 10000 3 15 1 70 1.9 5`

# The data
For the chimeras dataset, you can run with:

`python test_chimeras.py models/wiki_all.sent.split.model data/chimeras/chimeras.dataset.l4.tokenised.test.txt 1 10000 3 15 1 70 1.9 5`

(changing the chimeras test set for testing on 2, 4 or 6 sentences).


## The data

In the data/ folder, you will find two datasets, split into training and test sets:

Expand Down
5 changes: 4 additions & 1 deletion gensim/models/word2vec.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Original Word2Vec implementation:
# Copyright (C) 2013 Radim Rehurek <me@radimrehurek.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

#
# This modification (referred to as 'Nonce2Vec'):
# Aurelie Herbelot and Marco Baroni

"""
Deep learning via word2vec's "skip-gram and CBOW models", using either
Expand Down

0 comments on commit d08c961

Please sign in to comment.