Added info about code location.

minimalparts · Apr 13, 2018 · d08c961 · d08c961
1 parent eca6cb1
commit d08c961
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -10,20 +10,31 @@ A. Herbelot and M. Baroni. 2017. High-risk learning: Acquiring new word vectors
 
 Distributional semantics models are known to struggle with small data. It is generally accepted that in order to learn 'a good vector' for a word, a model must have sufficient examples of its usage. This contradicts the fact that humans can guess the meaning of a word from a few occurrences only. In this paper, we show that a neural language model such as Word2Vec only necessitates minor modifications to its standard architecture to learn new terms from tiny data, using background knowledge from a previously learnt semantic space. We test our model on word definitions and on a nonce task involving 2-6 sentences' worth of context, showing a large increase in performance over state-of-the-art models on the definitional task. 
 
-# Pre-requisites
+## A note on the code
+We have had queries about *where* exactly the Nonce2Vec code resides. Since it is a modification of the original gensim Word2Vec model, it is located in the gensim/models directory, confusingly still under the name *word2vec.py*. All modifications described in the paper are implemented in that file. Note that there is no C implementation of Nonce2Vec, so the program runs on standard numpy. Also, only skipgram is implemented -- the cbow functions in the code are original Word2Vec.
+
+
+## Pre-requisites
 You will need a pre-trained gensim model. You can go and train one yourself, using the gensim repo at [https://github.com/rare-technologies/gensim](https://github.com/rare-technologies/gensim), or simply download ours, pre-trained on Wikipedia: 
 
 `wget http://clic.cimec.unitn.it/~aurelie.herbelot/wiki_all.model.tar.gz`
 
 If you use our tar file, the content should be unpacked into the models/ directory of the repo.
 
-# Running the code
+## Running the code
 
 Here is an example of how to run the code on the test set of the definitional dataset, with the best identified parameters from the paper:
 
 `python test_def_nonces.py models/wiki_all.sent.split.model data/definitions/nonce.definitions.300.test 1 10000 3 15 1 70 1.9 5`
 
-# The data
+For the chimeras dataset, you can run with:
+
+`python test_chimeras.py models/wiki_all.sent.split.model data/chimeras/chimeras.dataset.l4.tokenised.test.txt 1 10000 3 15 1 70 1.9 5`
+
+(changing the chimeras test set for testing on 2, 4 or 6 sentences).
+
+
+## The data
 
 In the data/ folder, you will find two datasets, split into training and test sets:
 

diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py
@@ -1,9 +1,12 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 #
+# Original Word2Vec implementation:
 # Copyright (C) 2013 Radim Rehurek <me@radimrehurek.com>
 # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
-
+#
+# This modification (referred to as 'Nonce2Vec'):
+# Aurelie Herbelot and Marco Baroni
 
 """
 Deep learning via word2vec's "skip-gram and CBOW models", using either