-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provenance, copyright holders and licensing of gensim/test/test_data/
?
#3324
Comments
Yes, the test data could use a clean up. There are open tickets around that such as #2967. But honestly low priority, so I have no idea when we'll get to it. I have no capacity to hunt for licenses of the IMDB dataset (and others) unfortunately. IIRC they come from academic papers. If that's an issue for your task / employer, I'd suggest omitting them from your distribution. I don't think any of those files are necessary for Gensim to work. If I'm not mistaken, they are only there for CI testing + some of the tutorials (@mpenkov @gojomo CC). |
I definitely think the directory deserves a clean-up, given the cruft that's accumulated, & think some largely-automated approach would be best, roughly:
Generally, my default assumption is that whoever added data to this directory, at the time, believed there to be no copyright barriers to its inclusion & its use in this way. But, I couldn't assure that for any files I didn't personally add, as there's been no rigorous review. As such data isn't quite 'source code', nor does it include any in-file, or near-file, claim of authorship or copyright, I don't believe there is any presumption or implied assertion that such files are themselves licensed under the LGPL. They're just riding along in an unspecified licensing state that's unlikely to rise to any level of liability/concern. The data that appears to come via IMDB – 10 lines in the A similar analysis applies to the The only data I notice that appears to have possibly originated at Wikipedia are some brief article excerpts in the 11yo files |
Thanks for the investigation @gojomo. That matches what I remember – a non-issue except for highly theoretical what-if scenarios. Which, while valid, are zero priority for me right now. But if anyone wants to take this up, I'm willing to offer a review :) @Pabs how badly does your employer need this resolved? |
I agree for now there isn't really anything to be done with this issue,
but thanks for the followups, some further thoughts below.
Agreed that none of these files are likely to have any liability
concern, but my main concern here is that their license isn't
compatible with the Debian Free Software Guidelines or worse, that
they aren't redistributable at all unless redistributing without a
license and then relying on fair use to avoid liability.
For Debian the default assumption for files that have no clear
licensing attached is that they were either created and owned by the
project and are under the same license as the rest of the project, or
if there are indicators of originating elsewhere then they are probably
All Rights Reserved when they were gathered. Especially in the case of
machine learning datasets, where it seems they are usually pulled from
websites without consulting with or having a license from the end users
of the websites who added the data and are thus presumably the
copyright holders. Often the ToS of the website (which most users do
not read or really consent to) will have a clause about the website
retaining a license to redistribute, but that doesn't necessarily apply
to researchers and doesn't necessarily apply to redistributors
downstream from the researchers.
Unfortnately Debian and probably other redistributors cannot rely on
the "fair use" concept, it is not universal world-wide; for eg here in
Australia we have instead "fair dealing", which is much more
restrictive and doesn't allow the sort of use that is being suggested
to be fair use. The fair use concept also probably does not deliver all
of the freedoms required under the various definitions of libre
software; the Free Software Definition, Open Source Definition and the
Debian Free Software Guidelines (which the OSD was based on). For
example IIRC the "commercialness" of a particular use factors into the
tests for determining if fair use applies. It is also not a license,
just a defence against infringement to be used in court.
These other files definitely look like Wikipedia extracts. They are all
compressed and UTF-16, which is probably why GitHub can't find them.
bgwiki-latest-pages-articles-shortened.xml.bz2
enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2
enwiki-table-markup.xml.bz2
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
On behalf of my employer, I have packaged gensim for Debian:
https://tracker.debian.org/pkg/gensim
In the process of auditing the gensim git repository for inclusion in Debian I noticed by using web search engines that some of the files in the gensim/test/test_data/ directory seem to have been copied from the user comments on various websites such as IMDB. Presumably these comments were not owned by RaRe Technologies (or other gensim contributors) and were not licensed under the LGPL like the rest of gensim.
Other files seemed to indicate they were copied from Wikipedia, which definitely isn't LGPL. Others seemed to be statistics computed from some data and others seem to be generated files.
So I then wondered about all the files in the test data directory; where they came from, who owns them, what license they are under and since many of them are binary files how they were generated, what data were they generated from, what tools were they generated with and what the copyright/licensing of those tools are.
Without any answers to these questions I wasn't confident that I could get gensim into Debian quickly, so consequently I removed this directory from the Debian source package and added some patches.
I don't know if it will be feasible to reconcile this difference between the gensim git repository and the Debian source package, but I wanted to bring this to your attention and start a discussion about it.
It was mentioned in another issue that gensim tests in some cases generate files at test time instead of relying on pre-generated binary files. Perhaps some of the other tests could be changed to do that too.
For the cases where data is needed at test time, perhaps each data set could be in a separate directory and have a README alongside it detailing the provenance, copyright holders and licensing of each data set.
Some of the test data might no longer be needed and thus could be removed.
The text was updated successfully, but these errors were encountered: