Add unit tests for large input strings and a large corpus #20

jcbrockschmidt · 2020-11-20T07:06:28Z

In response to issue #19, we should add unit tests that run on large strings, as well as on a large corpus of strings. This should help us catch speed inefficiencies down the road.

snguyenthanh · 2020-11-25T01:44:30Z

I think this is a good idea. A problem is about where to store the test dataset, as I would prefer not to have a very big text file in the package which is used only for testing.

I would come up with a way to download and remove the dataset only for running tests. @jcbrockschmidt Can you help benchmark and improve the current algo for long texts ?

jcbrockschmidt · 2020-11-25T03:38:41Z

I am definitely willing to help. For the large unit tests, I was thinking it may be good enough to just include a dozen or so paragraphs (and their censored counterparts) and repeatedly test them 10-ish or 100-ish times. I think as long as the set of repeated paragraphs include an even mix of paragraphs with 1) no censored words 2) some censored words and 3) a lot of censored words it should be good enough to catch large slow-downs. This dataset shouldn't take up more than a few MBs.

It may, however, be a good idea to have a separate, more extensive benchmarking script separate from these new unit tests. For this script, yes: I think downloading the dataset would be wiser. I have a rough benchmarking script already written. The biggest challenge will just be finding a download link for our dataset that's reliable. The dataset I'm currently using is hosted on a lot of different websites with questionable reliability, so I'd need to track down its origin.

jcbrockschmidt · 2020-11-25T03:57:26Z

This link might be reliable enough for the Amazon reviews dataset I was looking at. We probably want to throw some extra datasets in the mix, though, such as very long documents (i.e. short stories or a books) with some profanity included.

jcbrockschmidt mentioned this issue Nov 20, 2020

Version 0.7.0 significantly slower than 0.6.1 #19

Open

jcbrockschmidt added the enhancement New feature or request label Dec 7, 2020

jcbrockschmidt mentioned this issue Dec 7, 2020

Add benchmarking code and unit tests for full paragraphs #26

Merged

jcbrockschmidt self-assigned this Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unit tests for large input strings and a large corpus #20

Add unit tests for large input strings and a large corpus #20

jcbrockschmidt commented Nov 20, 2020 •

edited

Loading

snguyenthanh commented Nov 25, 2020

jcbrockschmidt commented Nov 25, 2020

jcbrockschmidt commented Nov 25, 2020

Add unit tests for large input strings and a large corpus #20

Add unit tests for large input strings and a large corpus #20

Comments

jcbrockschmidt commented Nov 20, 2020 • edited Loading

snguyenthanh commented Nov 25, 2020

jcbrockschmidt commented Nov 25, 2020

jcbrockschmidt commented Nov 25, 2020

jcbrockschmidt commented Nov 20, 2020 •

edited

Loading