Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[nlp_data] Add BookCorpus #1406

Open
sxjscience opened this issue Oct 26, 2020 · 9 comments
Open

[nlp_data] Add BookCorpus #1406

sxjscience opened this issue Oct 26, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@sxjscience
Copy link
Member

Description

The book corpus can now have a reliable, stable download link from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz. Also, there are more links in https://the-eye.eu/public/AI/pile_preliminary_components/ that are worthwhile being included in nlp_data. We may try to download from their link and provide the corresponding license.

@sxjscience sxjscience added the enhancement New feature or request label Oct 26, 2020
@szha
Copy link
Member

szha commented Oct 26, 2020

the data source "smashwords" has a term of service that prohibits redistribution. neither in the links above nor in soskek/bookcorpus#27 was there any mention of getting approval from smashwords or approval from authors. we should clarify the legal risks before proceeding.

@shawwn
Copy link

shawwn commented Oct 28, 2020

There is no legal risk linking to the dataset. All risk is being taken on by The Eye.

The sole reason not to merge it is because someone doesn't like the idea of using the dataset. Which is fine. But anyone who says there is risk, is mistaken.

@shawwn
Copy link

shawwn commented Oct 28, 2020

(In other words, don't host the data yourself. Rely on the URL from The Eye. So, for example, all dataset preparation scripts should download from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz, and books1.tar.gz itself should not be hosted anywhere else. By following this pattern, all risk is transferred to The Eye.)

@szha
Copy link
Member

szha commented Oct 28, 2020

There is no legal risk linking to the dataset

In the US there's recognition of the secondary infringement liability. One can be found guilty for affirmative encouragement or inducing behavior for known copyright violations.

@shawwn
Copy link

shawwn commented Nov 1, 2020

The datasets are hosted by The Eye, which fully respects DMCA: http://the-eye.eu/dmca

If anyone were to file a DMCA notice against books1 or books3, they would extract the tarball, remove the infringing content, then re-upload the modified tarball.

There is no risk linking to The Eye.

@leezu
Copy link
Contributor

leezu commented Nov 2, 2020

re-upload the modified tarball

In GluonNLP we store a hash of the tarball in source to ensure reproducibility. Linking to a source that will periodically change the contents of the file may not be optimal.

@sxjscience
Copy link
Member Author

We may try to first add it and later figure out if we can hold a snapchat of BookCorpus by ourselves. What do you think?

@shawwn
Copy link

shawwn commented Nov 17, 2020

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as bookcorpusnew, thanks to @vblagoje: huggingface/datasets#856

So, huggingface is officially supporting this dataset now. The Eye also seems to be a trustworthy steward; I mentioned that "the tarball might change due to DMCA" as more of a theoretical concern rather than a practical reality. I doubt this tarball is going to change.

@sxjscience
Copy link
Member Author

@shawwn Really appreciate the information! I've tried out huggingface/datasets and find that it's quite good. In fact we can add it even if the tarball changes. It's the same as the strategy of the wikipedia corpus that we added: https://github.com/dmlc/gluon-nlp/blob/master/scripts/datasets/pretrain_corpus/prepare_wikipedia.py. Part of the purpose of nlp_data is to help the user download and prepare some large pretraining corpus for trying out NLP pretraining.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants