-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added Word2vec to Tensorflow 2D tensor file #1051
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested minor changes like comments and logging.
# | ||
# Copyright (C) 2016 Loreto Parisi <loretoparisi@gmail.com> | ||
# Copyright (C) 2016 Silvio Ogliastri <silvio.olivastri@gmail.com> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask for joint copyright as in https://github.com/RaRe-Technologies/gensim/wiki/Developer-page#legal ?
The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will | ||
us the --output file name as prefix | ||
This script is used to convert the word2vec format to Tensorflow 2D tensor and metadata formats for Embedding Visualization | ||
For more information about TensorBoard format see: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add instructions on how to see the viz in tensorboard, i.e. "Launch `tensorboard --logdir=dir_with_tsv"
|
||
logger = logging.getLogger(__name__) | ||
|
||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move docstring to after def word2vec2tensor
line
for word in model.index2word: | ||
file_metadata.write(word.encode('utf-8') + '\n') | ||
vector_row = '\t'.join(map(str, model[word])) | ||
file_vector.write(vector_row + '\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log the location and name of the written files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have added further instructions and logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an error occurring at line 54. TypeError: can't concat str to bytes
@anmol01gulati @parulsethi may I ask you to alpha test this PR? |
@word2vec_model_path word2vec model | ||
@tensor_filename tensor filename prefix | ||
''' | ||
model = gensim.models.Word2Vec.load_word2vec_format(word2vec_model_path, binary=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use text format here, or make it optional atleast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ask for format as optional like,
parser.add_argument( "-b", "--binary", required=False, help="If word2vec model in binary format, set True, else False ")
and pass the argument to word2vec2tensor function
def word2vec2tensor(word2vec_model_path,tensor_filename, binary=False):
model = gensim.models.Word2Vec.load_word2vec_format(word2vec_model_path, binary=binary)
keeping text format as default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot this earlier, space before tensor_filename
in def word2vec2tensor()
|
||
with open(outfiletsv, 'w+') as file_vector: | ||
with open(outfiletsvmeta, 'w+') as file_metadata: | ||
for word in model.index2word: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use model.wv.index2word
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this change get overwritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring of docstrings in python style, changed index2word api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed docs, added optional binary model format argument
Did all requested changes for logging, comments, optional arguments for binary mode. |
@tmylk does it need the separate test file? |
@parulsethi The problem with testing this is that we can't load it to tensorboard Travis. So no tests needed. |
@loretoparisi Thanks a lot for the PR! Visulisation is the top priority on our roadmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk @loretoparisi sorry, I only got to review now -- a few minor changes needed. Thanks for the cool new feature!
@@ -0,0 +1,86 @@ | |||
#!/usr/bin/env python | |||
# -*- coding: utf-8 -*- | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing license (LGPL, like the rest of gensim). @tmylk @loretoparisi
outfiletsv = tensor_filename + '_tensor.tsv' | ||
outfiletsvmeta = tensor_filename + '_metadata.tsv' | ||
|
||
with open(outfiletsv, 'w+') as file_vector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use smart_open
instead.
outfiletsvmeta = tensor_filename + '_metadata.tsv' | ||
|
||
with open(outfiletsv, 'w+') as file_vector: | ||
with open(outfiletsvmeta, 'w+') as file_metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dtto.
vector_row = '\t'.join(map(str, model[word])) | ||
file_vector.write(vector_row + '\n') | ||
|
||
logger.info("2D tensor file saved to %s" % outfiletsv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk little nitpick, but for the future, prefer logger.xyz("log %s", something)
, not logger.xyz("log %s" % something)
(use lazy argument formatting).
parser.add_argument( | ||
"-o", "--output", required=True, | ||
help="Output tensor file name prefix") | ||
parser.add_argument( "-b", "--binary", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indent in gensim -- use normal hanging indent.
FYI An alternative way to achieve this goal with tensorflow https://gist.github.com/lampts/026a4d6400b1efac9a13a3296f16e655 |
This script is used to convert the word2vec format to Tensorflow 2D tensor and metadata formats for Embedding Projector Visualization
For more information about TensorBoard format see: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/