-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New feature: wordrank wrapper #1066
Conversation
@@ -118,7 +118,7 @@ def readfile(fname): | |||
|
|||
python_2_6_backports = '' | |||
if sys.version_info[:2] < (2, 7): | |||
python_2_6_backports = ['argparse', 'subprocess32'] | |||
python_2_6_backports = ['argparse', 'subprocess32', 'backport_collections'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should subprocess32
be removed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -1154,7 +1154,7 @@ def check_output(*popenargs, **kwargs): | |||
Added extra KeyboardInterrupt handling | |||
""" | |||
try: | |||
process = subprocess.Popen(stdout=subprocess.PIPE, *popenargs, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stdout default has to be specified for other wrappers to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm that ldamallet wrapper works even without the default stdout specified? what prevents you from keeping it as default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ldamallet wrapper test passes without the default stdout
Please add the new class to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes
@@ -0,0 +1,286 @@ | |||
{ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call it "WordRank_wrapper_quickstart.ipynb"
], | ||
"source": [ | ||
"word_similarity_file = 'datasets/ws-353.txt'\n", | ||
"model.wv.evaluate_word_pairs(word_similarity_file)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge in latest develop to get correct output of this cell
"source": [ | ||
"# Comparison of WordRank, Word2Vec and FastText\n", | ||
"\n", | ||
"Wordrank is a fresh new approach to the word embeddings, which formulates it as a ranking problem. That is, given a word w, it aims to output an ordered list (c1, c2, · · ·) of context words such that words that co-occur with w appear at the top of the list. This formulation fits naturally to popular word embedding tasks such as word similarity/analogy since instead of the likelihood of each word, we are interested in finding the most relevant words\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to wordrank from here. same link as in references.
"metadata": {}, | ||
"source": [ | ||
"# Comparison of WordRank, Word2Vec and FastText\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a link to your blog. say "this ipynb accompanies a more theoretical blog post [link] "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding dummy url, will change if final url changes
>>> print model[word] # prints vector for given words | ||
|
||
.. [1] https://bitbucket.org/shihaoji/wordrank/ | ||
.. [2] https://arxiv.org/pdf/1506.02761v3.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add
# Copyright (C) 2017 Parul Sethi <email>
# Copyright (C) 2017 Radim Rehurek <me@radimrehurek.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
copyfile(corpus_file, os.path.join(meta_dir, corpus_file.split('/')[-1])) | ||
os.chdir(meta_dir) | ||
|
||
cmd0 = ['../../glove/vocab_count', '-min-count', str(min_count), '-max-vocab', str(max_vocab_size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please give meaningful names to variables like 'cmd_vocab_count'
cmd3 = ['cut', '-d', " ", '-f', '1', temp_vocab_file] | ||
cmds = [cmd0, cmd1, cmd2, cmd3] | ||
logger.info("Preparing training data using glove code '%s'", cmds) | ||
o0 = smart_open(temp_vocab_file, 'w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaningful names here too please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to open files in binary mode (wb
), and explicitly encode all strings written there.
Has this been tested on unicode (non-ASCII) data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried on hindi characters, it works
self.wv.vocab[word].count = counts[word] | ||
|
||
def ensemble_embedding(self, word_embedding, context_embedding): | ||
"""Addition of two embeddings.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better docstring Replace syn0 with the sum of context and word embeddings
glove2word2vec(context_embedding, context_embedding+'.w2vformat') | ||
w_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % word_embedding) | ||
c_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % context_embedding) | ||
assert Counter(w_emb.wv.index2word) == Counter(c_emb.wv.index2word), 'Vocabs are not same for both embeddings' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a vocab comparison would do. no need for Counter
outputs = [o0, o1, o2, o3] | ||
inputs = [i0, i1, i2, i3] | ||
prepare_train_data = [utils.check_output(cmd, stdin=inp, stdout=out) for cmd, inp, out in zip(cmds, inputs, outputs)] | ||
o0.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice is to use context managers for opening files for writing (with smart_open() as input0: ...
).
This whole code section would read better if rewritten as a loop (for command, input_fname, output_fname in zip(commands, input_fnames, output_fnames): with smart_open(...): ...
).
glove2word2vec(context_embedding, context_embedding+'.w2vformat') | ||
w_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % word_embedding) | ||
c_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % context_embedding) | ||
assert Counter(w_emb.wv.index2word) == Counter(c_emb.wv.index2word), 'Vocabs are not same for both embeddings' | ||
assert set(w_emb.wv.index2word) == set(c_emb.wv.index2word), 'Vocabs are not same for both embeddings' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to compare wv.vocab
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, similarly using set(wv.vocab)
correcting in next commit
@@ -0,0 +1,353 @@ | |||
love sex 6.77 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dataset is already in Gensim in https://github.com/parulsethi/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv Please use it there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,999 @@ | |||
old new 1.58 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move to test/test_data
so it can be used in other code. Similar to https://github.com/parulsethi/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"# WordRank wrapper tutorial on Lee Corpus\n", | ||
"\n", | ||
"WordRank is a new word embedding algorithm which captures the semantic similarities in a text data well. See this [notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Wordrank_comparisons.ipynb) for it's comparisons to other popular embedding models. This tutorial will serve as a guide to use the WordRank wrapper in gensim. You need to install [WordRank](https://bitbucket.org/shihaoji/wordrank) before proceeding with this tutorial.\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -234,7 +234,7 @@ def show_topics(self, num_topics=10, num_words=10, log=False, formatted=True): | |||
if formatted: | |||
topic = self.print_topic(i, topn=num_words) | |||
else: | |||
topic = self.show_topic(i, topn=num_words) | |||
topic = self.show_topic(i, num_words=num_words) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be in this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All show_topics
functions now have they same API as in https://github.com/RaRe-Technologies/gensim/blob/c2bfa48e71f851bc2f79fcbe51f531d6813a4503/gensim/models/basemodel.py#L12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that num_words is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just to mention in review thread)
show_topic()
doesn't have topn keyword argument but num_words, this fixes it
@@ -1146,15 +1146,16 @@ def keep_vocab_item(word, count, min_count, trim_rule=None): | |||
else: | |||
return default_res | |||
|
|||
def check_output(*popenargs, **kwargs): | |||
def check_output(*popenargs, stdout=subprocess.PIPE, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will keep the previous default stdout=subprocess.PIPE
for other wrappers, and a different stdout
can be defined for wordrank wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm bit doubtful about a workaround for this, it gave a syntax error for python2 in above check.
If I specify stdout=subprocess.PIPE
as first argument for making py2 compatible, it won't work for other wrappers as they have cmd
(basic command) as their first argument
@@ -103,7 +103,7 @@ def train(cls, wr_path, corpus_file, out_path, size=100, window=15, symmetric=1, | |||
for command, input_fname, output_fname in zip(commands, input_fnames, output_fnames): | |||
with smart_open(input_fname, 'rb') as r: | |||
with smart_open(output_fname, 'wb') as w: | |||
utils.check_output(command, stdin=r, stdout=w) | |||
utils.check_output(command, stdout=w, stdin=r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed as per current utils.check_output
update in this PR
@@ -1146,15 +1146,15 @@ def keep_vocab_item(word, count, min_count, trim_rule=None): | |||
else: | |||
return default_res | |||
|
|||
def check_output(*popenargs, stdout=subprocess.PIPE, **kwargs): | |||
def check_output(stdout=subprocess.PIPE, *popenargs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this keeps default stdout=subprocess.PIPE
. and check_output calls in all wrappers are updated to specify cmd as keyword argument rather than positional to work with this change
This PR adds python wrapper for Wordrank.
utils.check_output
is modified here to pass an open file as stdout which is required in wrapper's train method.Todo: