Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provenance, copyright holders and licensing of gensim/test/test_data/? #3324

Open
pabs3 opened this issue Apr 13, 2022 · 4 comments
Open

provenance, copyright holders and licensing of gensim/test/test_data/? #3324

pabs3 opened this issue Apr 13, 2022 · 4 comments
Labels
housekeeping internal tasks and processes

Comments

@pabs3
Copy link
Contributor

pabs3 commented Apr 13, 2022

On behalf of my employer, I have packaged gensim for Debian:

https://tracker.debian.org/pkg/gensim

In the process of auditing the gensim git repository for inclusion in Debian I noticed by using web search engines that some of the files in the gensim/test/test_data/ directory seem to have been copied from the user comments on various websites such as IMDB. Presumably these comments were not owned by RaRe Technologies (or other gensim contributors) and were not licensed under the LGPL like the rest of gensim.

Other files seemed to indicate they were copied from Wikipedia, which definitely isn't LGPL. Others seemed to be statistics computed from some data and others seem to be generated files.

So I then wondered about all the files in the test data directory; where they came from, who owns them, what license they are under and since many of them are binary files how they were generated, what data were they generated from, what tools were they generated with and what the copyright/licensing of those tools are.

Without any answers to these questions I wasn't confident that I could get gensim into Debian quickly, so consequently I removed this directory from the Debian source package and added some patches.

I don't know if it will be feasible to reconcile this difference between the gensim git repository and the Debian source package, but I wanted to bring this to your attention and start a discussion about it.

It was mentioned in another issue that gensim tests in some cases generate files at test time instead of relying on pre-generated binary files. Perhaps some of the other tests could be changed to do that too.

For the cases where data is needed at test time, perhaps each data set could be in a separate directory and have a README alongside it detailing the provenance, copyright holders and licensing of each data set.

Some of the test data might no longer be needed and thus could be removed.

@piskvorky
Copy link
Owner

piskvorky commented Apr 13, 2022

Yes, the test data could use a clean up. There are open tickets around that such as #2967. But honestly low priority, so I have no idea when we'll get to it.

I have no capacity to hunt for licenses of the IMDB dataset (and others) unfortunately. IIRC they come from academic papers. If that's an issue for your task / employer, I'd suggest omitting them from your distribution. I don't think any of those files are necessary for Gensim to work. If I'm not mistaken, they are only there for CI testing + some of the tutorials (@mpenkov @gojomo CC).

@piskvorky piskvorky added the housekeeping internal tasks and processes label Apr 13, 2022
@gojomo
Copy link
Collaborator

gojomo commented Apr 13, 2022

I definitely think the directory deserves a clean-up, given the cruft that's accumulated, & think some largely-automated approach would be best, roughly:

  • long before any urgent release, run all tests from some volume or test-harness that detects all file accesses; mark those as 'preserved'
  • grep source code for patterns in doc-comments/etc that indicate test-data accesses, extract any so-named-files, & ensure those are marked as 'preserved'
  • delete all the files not preselected as 'preserved'
  • see if over the next few months, & test releases or minor point releases, if anyone complains, & if so, consider re-adding any related files

Generally, my default assumption is that whoever added data to this directory, at the time, believed there to be no copyright barriers to its inclusion & its use in this way. But, I couldn't assure that for any files I didn't personally add, as there's been no rigorous review.

As such data isn't quite 'source code', nor does it include any in-file, or near-file, claim of authorship or copyright, I don't believe there is any presumption or implied assertion that such files are themselves licensed under the LGPL. They're just riding along in an unspecified licensing state that's unlikely to rise to any level of liability/concern.

The data that appears to come via IMDB – 10 lines in the alldata-id-10.txt file – seems a tiny excerpt from a 50,000 review dataset that canonically originates from https://ai.stanford.edu/~amaas/data/sentiment/ but is widely mirrored elsewhere (Kaggle, Google TensorFlow, HuggingFace, etc). Both the manner in which it was freely offered (without formal copyright or licensing declarations) for academic/research purposes, & the community practice of widespread mirroring/use, make me believe any relevant rightsholders approve. But even if they objected, 'fair use' standards that are strong in the US, with some analogues elsewhere, would suggest a use of this scale/purpose sidesteps copyright concerns.

A similar analysis applies to the simlex999.txt file.

The only data I notice that appears to have possibly originated at Wikipedia are some brief article excerpts in the 11yo files para2para_text1.txt & para2para_text2.txt. I'm not sure these are in use anymore - a Github search for [para2para_text1] shows no references in current code. If we did want to include Wikipedia excerpts and be fastidiously compliant, it might be enough to add a small para2para_texts.readme note alongside them, to the effect "These texts are excerpts from a contemporaneous Wikipedia(link) dump, and thus remain derivative works under Wikipedia's license(link)."

@piskvorky
Copy link
Owner

piskvorky commented Apr 13, 2022

Thanks for the investigation @gojomo. That matches what I remember – a non-issue except for highly theoretical what-if scenarios. Which, while valid, are zero priority for me right now.

But if anyone wants to take this up, I'm willing to offer a review :)

@Pabs how badly does your employer need this resolved?

@pabs3
Copy link
Contributor Author

pabs3 commented Apr 13, 2022 via email

pabs3 added a commit to pabs3/gensim that referenced this issue Mar 13, 2023
pabs3 added a commit to pabs3/gensim that referenced this issue Mar 13, 2023
pabs3 added a commit to pabs3/gensim that referenced this issue Apr 29, 2023
pabs3 added a commit to pabs3/gensim that referenced this issue May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
housekeeping internal tasks and processes
Projects
None yet
Development

No branches or pull requests

3 participants