Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and refactor WikiCorpus #1821

Merged
merged 9 commits into from
Jan 11, 2018

Conversation

steremma
Copy link
Contributor

In this PR, I try to improve the test coverage of the corpora package, specifically by adding tests for the WikiCorpus class. This is necessary as I soon plan to submit additional non trivial functionality for the same class and some tests are needed to ensure that nothing is broken. I also performed minor style changes in accordance to PEP8.

…the WikiCorpus class.

* Used the same raw data found for other sources (9 articles).

* Added Various wiki markup to test the parsing regural expressions
* Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase.

* Testing methods are overriden where necessary to reflect logic changes.

* All existing functionality is tested (account for markup handling, minimum article length etc)
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start @steremma, please continue your work 👍

@@ -400,6 +402,73 @@ def test_indexing(self):
pass


class TestWikiCorpus(TestTextCorpus):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of your tests "stuck" in Travis, please check, what's a problem and fix https://travis-ci.org/RaRe-Technologies/gensim/jobs/322641221#L828

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, question: why you need TextTextCorpus here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inheriting so that we make sure the WikiCorpus class passes not only its own tests but also the one's defined for the TextCorpus class (since WikiCorpus inherits from TextCorpus). Else we can inherit from CorpusTestCase or even from unittest.Testcase

self.assertEqual(len(docs), 9)

def test_empty_input(self):
tmpf = get_tmpfile('emptycorpus.xml.bz2')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use with temporary_file('emptycorpus.xml.bz2'): ... from gensim.test.utils here (because we have auto-cleanup in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to follow the same style as in the base test classes but you are right, context managers are safer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of "dirty" code in the codebase, cleanup slow but go forward.

def test_empty_input(self):
tmpf = get_tmpfile('emptycorpus.xml.bz2')
content = bz2.compress(b'') # Explicit string to byte conversion needed in python 3
fh = open(tmpf, "wb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of open / close please use with open(...)

all_articles = corpus.get_texts()
assert (len(list(all_articles)) == expected_articles)

test_with_limit(0, 9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to make it in "cycle" (instead of defining test in test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will change that

…thin the test_corpora.py file.

* Adapted all old tests to the new class

* Current Test class schema ensures that WikiCorpus also passes tests defined in parents

* Deleted test_wikicorpus.py since it is now redundant
@menshikh-iv
Copy link
Contributor

Ping @steremma, how is going with "stuck" bug? Are you already launched tests with a debugger?

@steremma
Copy link
Contributor Author

steremma commented Jan 9, 2018

Hello Ivan. I’m getting back from vacation tomorrow and I will keep working on it. I will let you know on gitter if the problem persists

wc = WikiCorpus(datapath(FILENAME), processes=1, token_max_len=16, lemmatize=False)
self.assertTrue(u'collectivization' in next(wc.get_texts()))

def test_custom_tokenizer(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test completely missing, please restore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it now

# self.assertEqual(type(first), list)
# self.assertTrue(isinstance(first[0], bytes) or isinstance(first[0], str))

def test_first_element(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test missing too, please restore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right adding it now

corpus = self.corpus_class(self.enwiki, processes=1, token_max_len=16, lemmatize=False)
self.assertTrue(u'collectivization' in next(corpus.get_texts()))

# TODO: sporadic failure to be investigated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a problem here?

Copy link
Contributor Author

@steremma steremma Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one exists commented on test_wikicorpus, I don't know why. So I decided to keep it in case we want to look into it in the future. The comment says that the test some times fails unpredictably and indeed it fails on my machine. Do we keep it as comment or completely delete it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least - leave in place.
Best variant - investigate & fix.

# self.assertEqual(type(first), list)
# self.assertTrue(isinstance(first[0], bytes) or isinstance(first[0], str))

def test_sample_text(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to skip this test (not silently pass)., what do you think @steremma?

Copy link
Contributor Author

@steremma steremma Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this test overrides the one defined in TestTextCorpus. If we just skip it the parent definition will be called and it will fail because we plain text is not legit XML (nor is it compressed). In that sense passing it silently practically ignores it. The same idea is followed in the tests:

    def test_save(self):
        pass

    def test_serialize(self):
        pass

    def test_serialize_compressed(self):
        pass

    def test_indexing(self):
        pass

of TestTextCorpus for these 4 tests defined in the parent class CorpusTestCase. So I think its better to keep it as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best variant - change class hierarchy and interfaces, we should'nt give "useless" methods from parent class (change from TestWikiCorpus -> TestTextCorpus to
TestTextCorpus -> BaseTestTextCorpus <- TestWikiCorpus), but right now this isn't really needed, stay current variant with "pass".

@menshikh-iv
Copy link
Contributor

@steremma LGTM, tell me, should I merge know, or you'll try to investigate a problem with #1821 (comment) first and merge after the fix?

@steremma
Copy link
Contributor Author

I propose that we merge it now for 2 reasons:

  1. I think the faulty test is not really Pythonic since it tries to enforce a type using isinstance (at least thats how I interpret "ask forgiveness not permission"). Since the behavior is correct (all tests pass) I don't think we should care about the exact types.

  2. I would prefer to work on a new feature for now (interlinks) since the tests are at a good point. If in the future we really want to enforce type checks we can come back to it. For this reason lets keep it commented like you said.

@menshikh-iv
Copy link
Contributor

Great start @steremma, congratz with first contribution 🥇

@menshikh-iv menshikh-iv merged commit 1648ac2 into piskvorky:develop Jan 11, 2018
@steremma
Copy link
Contributor Author

Thanks a lot, it wouldn't be possible without your guidance. Looking forward to contribute more :)

@steremma steremma deleted the test_refactor_wiki branch January 13, 2018 10:38
sharanry added a commit to sharanry/gensim that referenced this pull request Jan 15, 2018
* Add wordnet mammal train file for Poincare notebook (piskvorky#1781)

* Adds wordnet mammal train file

* Adds link to data file in notebook

* Fix tox.ini/setup.cfg configuration (piskvorky#1815)

* update according to new pytest_benchmark version

* update wheel-storage url

* use only twine

* Fix docstrings for `gensim.utils` (piskvorky#1797)

* Add docstrings in numpy-style fromat

* fix PEP8

* remove outdated "hack" (smart_open is core dependency right now)

* fix docstrings[1]

* remove unused internal class

* fix docstrings[2]

* fix docstrings[3]

* fix docstrings[4]

* fix docstrings[5]

* fix docstrings[6]

* fix docstrings[7]

* fix docstrings[8]

* add missing `pattern` to doc dependencies

* fix docstrings[9]

* fix docstrings[10]

* Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802)

* first attempt to convert few lines into numpy-style doc

* added parameters in documentation

* more documentation

* few corrections

* show inheritance and undoc members

* show special members

* example is executable now

* link to the paper added, named parameters

* fixed doc

* fixed doc

* fixed whitespaces

* fix docstrings & PEP8

* fix docstrings

* fix typo

* Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806)

* convert Space class doc to numpy style

* fix docstrings[1]

* fix docstrings[2]

* remove useless load

* fix docstrings[3]

* add missing import

* fix docstrings[4]

* Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822)

* init config for circle

* change

* rm cache, install tox distinctly

* fix indentation & command

* update venv

* add pip-cache

* add apt packages for latex

* rename

* enable latex rendering

* remove doc building from Travis

* store new doc version

* Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714)

* Refactored aggregation

* Micro-Fix for aggregation.py, partially refactored direct_confirmation.py

* Partially refactored indirect_confirmation_measure

* Some additions

* Math attempts

* add math extension for sphinx

* Minor refactoring

* Some refactoring for probability_estimation

* Beta-strings

* Different additions

* Minor changes

* text_analysis left

* Added example for ContextVectorComputer class

* probability_estimation 0.9

* beta_version

* Added some examples for text_analysis

* text_analysis: corrected example for class UsesDictionary

* Final additions for text_analysis.py

* fix cross-reference problem

* fix pep8

* fix aggregation

* fix direct_confirmation_measure

* fix types in direct_confirmation_measure

* partial fix indirect_confirmation_measure

* HotFix for probability_estimation and segmentation

* Refactoring for probability_estimation

* Changes for indirect_confirmation_measure

* Fixed segmentation, partly fixed text_analysis

* Add Notes for text_analysis

* fix di/ind

* fix doc examples in probability_estimation

* fix probability_estimation

* fix segmentation

* fix docstring in probability_estimation

* partial fix test_analysis

* add latex stuff for docs build

* doc fix[1]

* doc fix[2]

* remove apt install from travis (now doc build in circle)

* Fix docstrings for `gensim.models.normmodel` (piskvorky#1805)

* First edits

* changed bow

* Added examples

* Final commit of the night

* Still struggling with docs

* Removed examples but still struggling with documentation

* fix docstring

* fix docstring[2]

* Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803)

* improve and correct documentation of models/logentropy_model

* include fixes according to comments

* implement fixes suggested

* associate methods with examples.

* fix minor typos

* doc fix

* Fix docstrings for `gensim.matutils` (piskvorky#1804)

* numpy style documentation on matutils.py

* doc fix[1]

* doc fix[2]

* doc fix[3]

* doc fix[4]

* doc fix[5]

* doc fix[6]

* Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833)

* bm25 scoring function updated

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828

* Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821)

* minor style refactoring and comment fixes in accordance to PEP8

* Created test data in legitimate compressed XML format (.xml.bz2) for the WikiCorpus class.

* Used the same raw data found for other sources (9 articles).

* Added Various wiki markup to test the parsing regural expressions

* Added test class for the WikiCorpus source.

* Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase.

* Testing methods are overriden where necessary to reflect logic changes.

* All existing functionality is tested (account for markup handling, minimum article length etc)

* Fix python 3 compatibility for generator next method

* code review corrections

* Moved WikiCorpus tests from test/test_wikicorpus.py into its class within the test_corpora.py file.

* Adapted all old tests to the new class

* Current Test class schema ensures that WikiCorpus also passes tests defined in parents

* Deleted test_wikicorpus.py since it is now redundant

* Discarded the empty input test for the WikiCorpus since an empty file is not legitimate XML

* Added 2 more tests

* Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837)

* bm25 scoring function updated

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828 , Tests Added

* Fixes piskvorky#1828

* Function Parameters corrected , Fixes piskvorky#1818

* add missing params + add supercall

* Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823)

* add keyword params for call to gensim.models.CoherenceModel as positional arguments for coherence and topn were incorrect due to skipping param for keyed_vectors

* Fix PEP8
sharanry added a commit to sharanry/gensim that referenced this pull request Jan 15, 2018
…into smart_open

* 'develop' of https://github.com/RaRe-Technologies/gensim:
  Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823)
  Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837)
  Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821)
  Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833)
  Fix docstrings for `gensim.matutils` (piskvorky#1804)
  Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803)
  Fix docstrings for `gensim.models.normmodel` (piskvorky#1805)
  Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714)
  Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822)
  Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806)
  Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802)
  Fix docstrings for `gensim.utils` (piskvorky#1797)
  Fix tox.ini/setup.cfg configuration (piskvorky#1815)
  Add wordnet mammal train file for Poincare notebook (piskvorky#1781)
@mebrunet
Copy link

mebrunet commented May 8, 2018

@steremma - Thanks for the refactor. However, because of 574134e the behavior of passing an empty Dict when constructing a WikiCorpus has changed. Calling:

wiki = WikiCorpus(wikidumpfile, dictionary={})

used to prevent a full pass over the corpus, returning instantly. Now it triggers Dictionary(self.get_texts()).

The old behavior was useful when you only wanted to use get_texts without building a Dictionary, e.g. for training Word2Vec. There are also many tutorials online that have code snippets that assume the old behavior.

I think it should be changed back for compatibility.

@piskvorky
Copy link
Owner

Agreed, looks like an unintended change (bug).

@steremma
Copy link
Contributor Author

steremma commented May 9, 2018 via email

@steremma
Copy link
Contributor Author

steremma commented May 9, 2018

Please check #2042, I believe it fixes the issue

@mebrunet
Copy link

mebrunet commented May 9, 2018

I don't have time to clone build and test right now... but it looks right to me. Thanks.

@steremma
Copy link
Contributor Author

steremma commented May 9, 2018

Is this functionality common enough to get its own test? I would say it is, in which case it would be useful if you could provide the code that triggered the mistaken call to get_texts(). I can then wrap it into a unit test

@mebrunet
Copy link

mebrunet commented May 9, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants