You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In response to issue #19, we should add unit tests that run on large strings, as well as on a large corpus of strings. This should help us catch speed inefficiencies down the road.
The text was updated successfully, but these errors were encountered:
I think this is a good idea. A problem is about where to store the test dataset, as I would prefer not to have a very big text file in the package which is used only for testing.
I would come up with a way to download and remove the dataset only for running tests. @jcbrockschmidt Can you help benchmark and improve the current algo for long texts ?
I am definitely willing to help. For the large unit tests, I was thinking it may be good enough to just include a dozen or so paragraphs (and their censored counterparts) and repeatedly test them 10-ish or 100-ish times. I think as long as the set of repeated paragraphs include an even mix of paragraphs with 1) no censored words 2) some censored words and 3) a lot of censored words it should be good enough to catch large slow-downs. This dataset shouldn't take up more than a few MBs.
It may, however, be a good idea to have a separate, more extensive benchmarking script separate from these new unit tests. For this script, yes: I think downloading the dataset would be wiser. I have a rough benchmarking script already written. The biggest challenge will just be finding a download link for our dataset that's reliable. The dataset I'm currently using is hosted on a lot of different websites with questionable reliability, so I'd need to track down its origin.
This link might be reliable enough for the Amazon reviews dataset I was looking at. We probably want to throw some extra datasets in the mix, though, such as very long documents (i.e. short stories or a books) with some profanity included.
In response to issue #19, we should add unit tests that run on large strings, as well as on a large corpus of strings. This should help us catch speed inefficiencies down the road.
The text was updated successfully, but these errors were encountered: