-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test and refactor WikiCorpus #1821
Conversation
…the WikiCorpus class. * Used the same raw data found for other sources (9 articles). * Added Various wiki markup to test the parsing regural expressions
f9ace21
to
4125647
Compare
* Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase. * Testing methods are overriden where necessary to reflect logic changes. * All existing functionality is tested (account for markup handling, minimum article length etc)
4125647
to
836c3c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start @steremma, please continue your work 👍
gensim/test/test_corpora.py
Outdated
@@ -400,6 +402,73 @@ def test_indexing(self): | |||
pass | |||
|
|||
|
|||
class TestWikiCorpus(TestTextCorpus): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of your tests "stuck" in Travis, please check, what's a problem and fix https://travis-ci.org/RaRe-Technologies/gensim/jobs/322641221#L828
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, question: why you need TextTextCorpus
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inheriting so that we make sure the WikiCorpus
class passes not only its own tests but also the one's defined for the TextCorpus
class (since WikiCorpus
inherits from TextCorpus
). Else we can inherit from CorpusTestCase
or even from unittest.Testcase
gensim/test/test_corpora.py
Outdated
self.assertEqual(len(docs), 9) | ||
|
||
def test_empty_input(self): | ||
tmpf = get_tmpfile('emptycorpus.xml.bz2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use with temporary_file('emptycorpus.xml.bz2'): ...
from gensim.test.utils
here (because we have auto-cleanup in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to follow the same style as in the base test classes but you are right, context managers are safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a lot of "dirty" code in the codebase, cleanup slow but go forward.
gensim/test/test_corpora.py
Outdated
def test_empty_input(self): | ||
tmpf = get_tmpfile('emptycorpus.xml.bz2') | ||
content = bz2.compress(b'') # Explicit string to byte conversion needed in python 3 | ||
fh = open(tmpf, "wb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of open
/ close
please use with open(...)
gensim/test/test_corpora.py
Outdated
all_articles = corpus.get_texts() | ||
assert (len(list(all_articles)) == expected_articles) | ||
|
||
test_with_limit(0, 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to make it in "cycle" (instead of defining test in test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will change that
…thin the test_corpora.py file. * Adapted all old tests to the new class * Current Test class schema ensures that WikiCorpus also passes tests defined in parents * Deleted test_wikicorpus.py since it is now redundant
Ping @steremma, how is going with "stuck" bug? Are you already launched tests with a debugger? |
Hello Ivan. I’m getting back from vacation tomorrow and I will keep working on it. I will let you know on gitter if the problem persists |
wc = WikiCorpus(datapath(FILENAME), processes=1, token_max_len=16, lemmatize=False) | ||
self.assertTrue(u'collectivization' in next(wc.get_texts())) | ||
|
||
def test_custom_tokenizer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test completely missing, please restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding it now
# self.assertEqual(type(first), list) | ||
# self.assertTrue(isinstance(first[0], bytes) or isinstance(first[0], str)) | ||
|
||
def test_first_element(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test missing too, please restore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right adding it now
gensim/test/test_corpora.py
Outdated
corpus = self.corpus_class(self.enwiki, processes=1, token_max_len=16, lemmatize=False) | ||
self.assertTrue(u'collectivization' in next(corpus.get_texts())) | ||
|
||
# TODO: sporadic failure to be investigated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one exists commented on test_wikicorpus, I don't know why. So I decided to keep it in case we want to look into it in the future. The comment says that the test some times fails unpredictably and indeed it fails on my machine. Do we keep it as comment or completely delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least - leave in place.
Best variant - investigate & fix.
# self.assertEqual(type(first), list) | ||
# self.assertTrue(isinstance(first[0], bytes) or isinstance(first[0], str)) | ||
|
||
def test_sample_text(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to skip this test (not silently pass)., what do you think @steremma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this test overrides the one defined in TestTextCorpus
. If we just skip it the parent definition will be called and it will fail because we plain text is not legit XML (nor is it compressed). In that sense passing it silently practically ignores it. The same idea is followed in the tests:
def test_save(self):
pass
def test_serialize(self):
pass
def test_serialize_compressed(self):
pass
def test_indexing(self):
pass
of TestTextCorpus
for these 4 tests defined in the parent class CorpusTestCase
. So I think its better to keep it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best variant - change class hierarchy and interfaces, we should'nt give "useless" methods from parent class (change from TestWikiCorpus -> TestTextCorpus
to
TestTextCorpus -> BaseTestTextCorpus <- TestWikiCorpus
), but right now this isn't really needed, stay current variant with "pass".
@steremma LGTM, tell me, should I merge know, or you'll try to investigate a problem with #1821 (comment) first and merge after the fix? |
I propose that we merge it now for 2 reasons:
|
Great start @steremma, congratz with first contribution 🥇 |
Thanks a lot, it wouldn't be possible without your guidance. Looking forward to contribute more :) |
* Add wordnet mammal train file for Poincare notebook (piskvorky#1781) * Adds wordnet mammal train file * Adds link to data file in notebook * Fix tox.ini/setup.cfg configuration (piskvorky#1815) * update according to new pytest_benchmark version * update wheel-storage url * use only twine * Fix docstrings for `gensim.utils` (piskvorky#1797) * Add docstrings in numpy-style fromat * fix PEP8 * remove outdated "hack" (smart_open is core dependency right now) * fix docstrings[1] * remove unused internal class * fix docstrings[2] * fix docstrings[3] * fix docstrings[4] * fix docstrings[5] * fix docstrings[6] * fix docstrings[7] * fix docstrings[8] * add missing `pattern` to doc dependencies * fix docstrings[9] * fix docstrings[10] * Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) * first attempt to convert few lines into numpy-style doc * added parameters in documentation * more documentation * few corrections * show inheritance and undoc members * show special members * example is executable now * link to the paper added, named parameters * fixed doc * fixed doc * fixed whitespaces * fix docstrings & PEP8 * fix docstrings * fix typo * Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) * convert Space class doc to numpy style * fix docstrings[1] * fix docstrings[2] * remove useless load * fix docstrings[3] * add missing import * fix docstrings[4] * Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) * init config for circle * change * rm cache, install tox distinctly * fix indentation & command * update venv * add pip-cache * add apt packages for latex * rename * enable latex rendering * remove doc building from Travis * store new doc version * Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) * Refactored aggregation * Micro-Fix for aggregation.py, partially refactored direct_confirmation.py * Partially refactored indirect_confirmation_measure * Some additions * Math attempts * add math extension for sphinx * Minor refactoring * Some refactoring for probability_estimation * Beta-strings * Different additions * Minor changes * text_analysis left * Added example for ContextVectorComputer class * probability_estimation 0.9 * beta_version * Added some examples for text_analysis * text_analysis: corrected example for class UsesDictionary * Final additions for text_analysis.py * fix cross-reference problem * fix pep8 * fix aggregation * fix direct_confirmation_measure * fix types in direct_confirmation_measure * partial fix indirect_confirmation_measure * HotFix for probability_estimation and segmentation * Refactoring for probability_estimation * Changes for indirect_confirmation_measure * Fixed segmentation, partly fixed text_analysis * Add Notes for text_analysis * fix di/ind * fix doc examples in probability_estimation * fix probability_estimation * fix segmentation * fix docstring in probability_estimation * partial fix test_analysis * add latex stuff for docs build * doc fix[1] * doc fix[2] * remove apt install from travis (now doc build in circle) * Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) * First edits * changed bow * Added examples * Final commit of the night * Still struggling with docs * Removed examples but still struggling with documentation * fix docstring * fix docstring[2] * Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) * improve and correct documentation of models/logentropy_model * include fixes according to comments * implement fixes suggested * associate methods with examples. * fix minor typos * doc fix * Fix docstrings for `gensim.matutils` (piskvorky#1804) * numpy style documentation on matutils.py * doc fix[1] * doc fix[2] * doc fix[3] * doc fix[4] * doc fix[5] * doc fix[6] * Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) * bm25 scoring function updated * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 * Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) * minor style refactoring and comment fixes in accordance to PEP8 * Created test data in legitimate compressed XML format (.xml.bz2) for the WikiCorpus class. * Used the same raw data found for other sources (9 articles). * Added Various wiki markup to test the parsing regural expressions * Added test class for the WikiCorpus source. * Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase. * Testing methods are overriden where necessary to reflect logic changes. * All existing functionality is tested (account for markup handling, minimum article length etc) * Fix python 3 compatibility for generator next method * code review corrections * Moved WikiCorpus tests from test/test_wikicorpus.py into its class within the test_corpora.py file. * Adapted all old tests to the new class * Current Test class schema ensures that WikiCorpus also passes tests defined in parents * Deleted test_wikicorpus.py since it is now redundant * Discarded the empty input test for the WikiCorpus since an empty file is not legitimate XML * Added 2 more tests * Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) * bm25 scoring function updated * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 * Function Parameters corrected , Fixes piskvorky#1818 * add missing params + add supercall * Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) * add keyword params for call to gensim.models.CoherenceModel as positional arguments for coherence and topn were incorrect due to skipping param for keyed_vectors * Fix PEP8
…into smart_open * 'develop' of https://github.com/RaRe-Technologies/gensim: Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) Fix docstrings for `gensim.matutils` (piskvorky#1804) Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) Fix docstrings for `gensim.utils` (piskvorky#1797) Fix tox.ini/setup.cfg configuration (piskvorky#1815) Add wordnet mammal train file for Poincare notebook (piskvorky#1781)
@steremma - Thanks for the refactor. However, because of 574134e the behavior of passing an empty Dict when constructing a WikiCorpus has changed. Calling:
used to prevent a full pass over the corpus, returning instantly. Now it triggers The old behavior was useful when you only wanted to use I think it should be changed back for compatibility. |
Agreed, looks like an unintended change (bug). |
Thanks for pointing this out. I am currently out of base but I will check
it when I get the time (and hopefully fix)
…On Wed 9. May 2018 at 01:27, Radim Řehůřek ***@***.***> wrote:
Agreed, looks like an unintended change (bug).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1821 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AI2m49NZbSCWylYYalz1COM6A5LhK2p7ks5twhvYgaJpZM4ROdLj>
.
|
Please check #2042, I believe it fixes the issue |
I don't have time to clone build and test right now... but it looks right to me. Thanks. |
Is this functionality common enough to get its own test? I would say it is, in which case it would be useful if you could provide the code that triggered the mistaken call to |
Sure. The unit test is easy.
spy on `get_texts`
construct a new WikiCorpus with `WikiCorpus(dumpfile, dictionary={})`
assert spy was not called
You could extend that to make sure the spy is called if `dictionary=None`,
though that may already be tested.
Sent while mobile. Please excuse brevity and typos.
On May 9, 2018 16:32, "Stergiadis Manos" <notifications@github.com> wrote:
Is this functionality common enough to get its own test? I would say it is,
in which case it would be useful if you could provide the code that
triggered the mistaken call to get_texts(). I can then wrap it into a unit
test
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1821 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEPgEQprjALyMOui8SbOalBEm4nNpPesks5tw1JYgaJpZM4ROdLj>
.
|
In this PR, I try to improve the test coverage of the corpora package, specifically by adding tests for the WikiCorpus class. This is necessary as I soon plan to submit additional non trivial functionality for the same class and some tests are needed to ensure that nothing is broken. I also performed minor style changes in accordance to PEP8.