[nlp_data] Add BookCorpus #1406

sxjscience · 2020-10-26T15:18:36Z

Description

The book corpus can now have a reliable, stable download link from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz. Also, there are more links in https://the-eye.eu/public/AI/pile_preliminary_components/ that are worthwhile being included in nlp_data. We may try to download from their link and provide the corresponding license.

The text was updated successfully, but these errors were encountered:

szha · 2020-10-26T15:38:12Z

the data source "smashwords" has a term of service that prohibits redistribution. neither in the links above nor in soskek/bookcorpus#27 was there any mention of getting approval from smashwords or approval from authors. we should clarify the legal risks before proceeding.

shawwn · 2020-10-28T07:50:11Z

There is no legal risk linking to the dataset. All risk is being taken on by The Eye.

The sole reason not to merge it is because someone doesn't like the idea of using the dataset. Which is fine. But anyone who says there is risk, is mistaken.

shawwn · 2020-10-28T07:52:13Z

(In other words, don't host the data yourself. Rely on the URL from The Eye. So, for example, all dataset preparation scripts should download from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz, and books1.tar.gz itself should not be hosted anywhere else. By following this pattern, all risk is transferred to The Eye.)

szha · 2020-10-28T19:37:50Z

There is no legal risk linking to the dataset

In the US there's recognition of the secondary infringement liability. One can be found guilty for affirmative encouragement or inducing behavior for known copyright violations.

shawwn · 2020-11-01T14:03:35Z

The datasets are hosted by The Eye, which fully respects DMCA: http://the-eye.eu/dmca

If anyone were to file a DMCA notice against books1 or books3, they would extract the tarball, remove the infringing content, then re-upload the modified tarball.

There is no risk linking to The Eye.

leezu · 2020-11-02T17:05:33Z

re-upload the modified tarball

In GluonNLP we store a hash of the tarball in source to ensure reproducibility. Linking to a source that will periodically change the contents of the file may not be optimal.

sxjscience · 2020-11-02T17:07:22Z

We may try to first add it and later figure out if we can hold a snapchat of BookCorpus by ourselves. What do you think?

shawwn · 2020-11-17T20:45:17Z

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as bookcorpusnew, thanks to @vblagoje: huggingface/datasets#856

So, huggingface is officially supporting this dataset now. The Eye also seems to be a trustworthy steward; I mentioned that "the tarball might change due to DMCA" as more of a theoretical concern rather than a practical reality. I doubt this tarball is going to change.

sxjscience · 2020-11-17T20:53:57Z

@shawwn Really appreciate the information! I've tried out huggingface/datasets and find that it's quite good. In fact we can add it even if the tarball changes. It's the same as the strategy of the wikipedia corpus that we added: https://github.com/dmlc/gluon-nlp/blob/master/scripts/datasets/pretrain_corpus/prepare_wikipedia.py. Part of the purpose of nlp_data is to help the user download and prepare some large pretraining corpus for trying out NLP pretraining.

sxjscience added the enhancement New feature or request label Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nlp_data] Add BookCorpus #1406

[nlp_data] Add BookCorpus #1406

sxjscience commented Oct 26, 2020

szha commented Oct 26, 2020

shawwn commented Oct 28, 2020

shawwn commented Oct 28, 2020

szha commented Oct 28, 2020

shawwn commented Nov 1, 2020

leezu commented Nov 2, 2020

sxjscience commented Nov 2, 2020

shawwn commented Nov 17, 2020

sxjscience commented Nov 17, 2020

[nlp_data] Add BookCorpus #1406

[nlp_data] Add BookCorpus #1406

Comments

sxjscience commented Oct 26, 2020

Description

szha commented Oct 26, 2020

shawwn commented Oct 28, 2020

shawwn commented Oct 28, 2020

szha commented Oct 28, 2020

shawwn commented Nov 1, 2020

leezu commented Nov 2, 2020

sxjscience commented Nov 2, 2020

shawwn commented Nov 17, 2020

sxjscience commented Nov 17, 2020