Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738

Merged
merged 19 commits into from
Dec 1, 2017
Merged

Conversation

PeteBleackley
Copy link
Contributor

Implemented as per #665

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import scipy

def mz_keywords(text,blocksize=1024,scores=False,split=False,weighted=True,threshold=0.0):
"""Extract keywords from text using the Montemurro and Zanette entropy algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style format for docstring (here and everywhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the return type varies according to the scores and split arguments, what should I put in the Returns section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerate all + add description what's condition for each type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to an example in the existing docs?

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeteBleackley sorry, but I have no example for this case.

'auto' calculates the threshold as
nblocks/(nblocks+1.0)
Use 'auto' with weighted=False)"""
text=to_unicode(text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many PEP8 issues (in each line), look to checker log.

logp=numpy.log2(p)
H=numpy.nan_to_num((p*logp),0.0).sum(axis=0)

def log_combinations(n,m):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better move definition log_combinations, marginal_prob, marginal outside and start names with _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring the core functionality into a private class to address this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's possible option

text=to_unicode(text)
words=[word for word in _tokenize_by_word(text)]
vocab=sorted(set(words))
wordcounts=numpy.array([[words[i:i+blocksize].count(word) for word in vocab]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use hanging indents (instead of vertical), here and everywhere.

@menshikh-iv
Copy link
Contributor

ping @PeteBleackley

@PeteBleackley
Copy link
Contributor Author

I have fixed the style issues, refactored the inner functions and written a test. I just need to update the tutorial now.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 27, 2017

Don't forget push your commits @PeteBleackley :) (sorry misclick)

@menshikh-iv menshikh-iv reopened this Nov 27, 2017
@PeteBleackley
Copy link
Contributor Author

Bother, I've got a nasty merge conflict in the tutorial now.

@menshikh-iv
Copy link
Contributor

Oh, resolve merge conflict in notebooks is painful, I advise you to apply versions from gensim and make changes again.

@PeteBleackley
Copy link
Contributor Author

More or less fixed now, except that all the newlines are double escaped.

@PeteBleackley
Copy link
Contributor Author

Pushed changes.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please continue improvements, also look at https://travis-ci.org/RaRe-Technologies/gensim/jobs/307965705#L833 (something happened with your algorithm).

@@ -0,0 +1,512 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect file, please remove it.

"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the algorithm weights the entropy by the overall frequency of the word in the document. We can remove this weighting by setting weighted=False"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the output of your code? descriptions are needed too of course, but user want's to see the result (how it works).

@@ -147,6 +147,22 @@ def test_keywords_runs(self):

kwds_lst = keywords(text, split=True)
self.assertTrue(len(kwds_lst))

def test_mz_keywords(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add the more concrete tests (not only "sanity check"), I mean with the concrete words.

text = to_unicode(text)
words = [word for word in _tokenize_by_word(text)]
vocab = sorted(set(words))
wordcounts = numpy.array([[words[i:i+blocksize].count(word)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line isn't very long, no needed \n here (here and after for range and weights)

@@ -0,0 +1,118 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now coding style is better, but contains problems yet, please have a look to flake8 log - https://travis-ci.org/RaRe-Technologies/gensim/jobs/307965704#L517

@PeteBleackley
Copy link
Contributor Author

Have fixed the last set of requested changes, but there are some issues with a set of tests that I didn't write.

@PeteBleackley
Copy link
Contributor Author

All Flake8 issues fixed now, but the tests are timing out. What do you suggest?

self.assertTrue(len(kwds_lst))
kwds_auto = mz_keywords(text, scores=True, weighted=False,
threshold='auto')
self.assertTrue(kwds_auto[-1][1] > 329.0 / 330.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this "magic" 329.0 / 330.0 means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document I'm using for testing divides into 329 blocks, when the default block size of 1024 words is used. When threshold='auto', the threshold is calculated as nblocks/(nblocks+1). This is in the docstring for mz_entropy, but I'll add a comment to the test.

@@ -148,6 +148,26 @@ def test_keywords_runs(self):
kwds_lst = keywords(text, split=True)
self.assertTrue(len(kwds_lst))

def test_mz_keywords(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis hang because your implementation is really slow (test don't finish successfully for 10 minutes), you can make several changes

  1. Try to optimize mz_keywords
  2. Use really small corpus (micro-sample from current dataset for example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try using the first 10240 words.

@PeteBleackley
Copy link
Contributor Author

All tests passing now.

@menshikh-iv
Copy link
Contributor

Thanks @PeteBleackley, nice first contribution 🔥 👍 !

@menshikh-iv menshikh-iv merged commit c462bd0 into piskvorky:develop Dec 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants