-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error due to mallet 2.0.7 solved #664
Conversation
@@ -36,6 +36,8 @@ | |||
|
|||
import numpy | |||
|
|||
from bs4 import BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Beautiful soup for parsing correct XML is an overkill. Please use from xml.etree import ElementTree
@@ -245,11 +247,20 @@ def show_topic(self, topicid, topn=10): | |||
def print_topic(self, topicid, topn=10): | |||
return ' + '.join(['%.3f*%s' % v for v in self.show_topic(topicid, topn)]) | |||
|
|||
######### function to return the version of mallet ###### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Gensim style of doc-strings (PEP257).
doc = [] | ||
while pointer < len(parts): | ||
if float(parts[pointer]) == int(parts[pointer]): | ||
if float(parts[pointer+1]) > eps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the while
condition, this [pointer + 1]
access can result in out-of-bounds IndexError
, no?
Also, the logic around these if
s deserves some comments, it's not clear what's happening and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think so. I am adding comments for this if logic. Except when there are no topics for a document, it will work fine. Neverthless I wiill run some more test cases to validate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this branch correctly, then if the last number on a line is an int, this if will raise an IndexError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the boundary condition as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will. But consider the following:
There are only 2 formats possible namely 20.2345 or 2 0.2345, right
in the former case the while block will go to the else condition and in latter case it will go to if section. The format type produced by mallet is a kind of security for this error.
@cscorley Please review the fix in this PR. I have also sent you an email about the same with other details. |
""" | ||
Check version of mallet via jar file | ||
""" | ||
archive = zipfile.ZipFile(direc_path, 'r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use smart_open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smart_open
doesn't handle zip files (=multiple files in a single archive). It only handles file compression, not multi-file archives like zip, tar, rar etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, according to the following eg in the smart_open repo's read me:
for line in smart_open.smart_open('./foo.txt.gz'):
... print line
We can only unzip a file (say a txt file).
I think zipfile is OK. If there not then I will look for another library to handle jar file.
Testing this with Mallet 2.0.8RC3 it seems that when topic_threshold is set the output reverts back to topic/proportion format from the dense format. |
That seems to be correct. Thanks for pointing out @syting . I have taken care of that. |
purposing a solution for : gensim==0.12.4 mallet wrapper is not compatible with mallet-2.0.7 #599