Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data/Model storage #1453

Closed
menshikh-iv opened this issue Jun 27, 2017 · 19 comments
Closed

Data/Model storage #1453

menshikh-iv opened this issue Jun 27, 2017 · 19 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 27, 2017

We want to store trained models and popular dataset (in raw/preprocessed format). Also, we want to develop a simple API for accessing this data.

This project makes our users a bit happier.

Plan:

  1. Check how another frameworks deal with sharing data (sklearn, spacy, nltk, etc). Describe the advantages and disadvantages.
  2. Look at Google storage offer for open-source projects (or other offers for an open-source project). We want to host data without payments (because traffic might be substantial).
  3. Implement the API for downloading these external, potentially large datasets (based on research from 1.)
  4. Create popular models and datasets and upload them to the storage
  5. Write a clear, beginner-friendly tutorial (Jupyter notebook) on how to use this functionality.
@macks22
Copy link
Contributor

macks22 commented Jun 28, 2017

Issues #717 and #746 are closely related to this.

@souravsingh
Copy link
Contributor

@menshikh-iv sklearn stores smaller datasets and models in a separate folder and also provides a fetcher for datasets which could be large or require preprocessing. The datasets can be downloading by importing the necessary dataset namespace from sklearn.datasets

NTLK provides a downloader which can be imported to download all the datasets available.

For Storing the datasets, we can store them in the repo if they aren't large, or write a downloader script which can do the job.

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 3, 2017

@souravsingh thanks for info, let's wait for detailed comparison from @chaitaliSaini

@chaitaliSaini
Copy link
Contributor

chaitaliSaini commented Jul 5, 2017

NLTK : It provides downloader with several interfaces(interactie installer and installation via command line) which can be used to download corpora, models, and other data packages that can be used with NLTK.(https://github.com/nltk/nltk/blob/develop/nltk/downloader.py)
Eg : nltk.download() will open a new window of NLTK downloader and after that user can select the packages that they want to download.

sklearn : sklearn comes with a few small standard datasets that do not require to download any file from some external website. They have store other datasets on mldata.org and sklearn.datasets package is able to directly download data sets from the repository.
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/mldata.py)
Eg : to download the MNIST digit recognition database:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home=custom_data_home)

spacy : It allows models to be downloaded and loaded manually, or using spaCy's download and link commands.
(https://github.com/explosion/spaCy/blob/master/spacy/cli/download.py)
Eg : python -m spacy download en
The download command will install the model via pip, place the package in the site-packages directory and create a shortcut link that lets the user load the model by name.
OR installing directly via pip
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
OR
user can download it manually. To use it with spaCy one has to assign it a name by creating a shortcut link for the data directory.
python -m spacy link [package name or path] [shortcut] [--force]

For Storage :
Google Storage : They do not provide free cloud services to open source projects and if we want to avail free cloud services, then they have 2 options.
1.we can use google cloud free for one year.
2.we can always use google cloud for free but with some restriction in number of monthly queries,etc.
(https://cloud.google.com/free/)

mldata.org : It's a public repository for datasets. Its free of cost. Dataset's file size is limited to 1Gb.
(http://mldata.org/)

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 8, 2017

For my opinion, we should use "hybrid" approach sklearn+spacy.
For "programming" way we will use several methods:

  • lookup("datasets") / lookup("models") - returns list of datasets/models with short description. Also should works lookup("models/fastText") for example
  • fetch_data("path/to/model", output_folder="/a/b/c") / fetch_data("path/to/dataset", output_folder="/a/b/c") - download dataset and store in folder, return path to main file (that we should use for load model for example)

For example, a user wants to download english Wikipedia and store it to hdd on local machine:
fetch_data("dataset/wikipedia/english", output_folder="/home/username/my_storage/")

For "console" way we will use same methods, that calls across submodule

  • python -m gensim downloader.fetch_data ...
  • python -m gensim downloader.lookup ...

What do you think @gojomo @piskvorky @chaitaliSaini?

@souravsingh
Copy link
Contributor

@menshikh-iv We can talk to Rackspace for their cloud hosting service. Most of the open-source projects( MacPython, scikit-learn and manylinux) use Rackspace hosting.

@menshikh-iv
Copy link
Contributor Author

@souravsingh we used too (as temporary storage for wheels), need to investigate this question.

@menshikh-iv
Copy link
Contributor Author

I investigate SpaCy approach for Data storage and this approach is awesome! Look at spacy-models repo. They attach models to release in github.

It's unlimited for cumulative file size/queries/etc and free, only one limitation - file size < 2Gb.
I think this is the best approach for model/dataset storage 👍 .

@gojomo
Copy link
Collaborator

gojomo commented Jul 12, 2017

Another option for big, large-traffic datasets where gensim would want to be insulated from the potential costs-of-download-popularity is AWS S3 "requester-pays" buckets. Arxiv uses them; see:

https://arxiv.org/help/bulk_data_s3

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017
@akutuzov
Copy link
Contributor

akutuzov commented Oct 2, 2017

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 11, 2017

New plan proposal

Need to implement 2 functions: load and info (this functions for public API)

Naming convention:
for datasets: lowercase + replace spaces to '-', something like "en-wikipedia-full" (no strict structure)
for models: <model_type>-<model_name>-<dimension-of-vector?>

def info(name=None) - information about data, returns json with info
For models
- what's source (whence the model, links to detailed description OR describe it here)
- parameters
- related papers (if needed)
- the dataset that used for training (if we know)
- link to preprocessing code (if we have)

For datasets:
- what's source (whence the model, links to detailed description OR describe it here)
- related papers (if needed)

if name==None - return full json with data

def load(name, return_model_path=False) - download + load data
For models: return loaded model OR path to folder with models
For dataset: return path to folder with dataset

Algorithm

  1. If not exists ~/gensim-data:
    • create this folder
  2. If name is available in ~/gensim-data:
    • return loaded to memory / path (depends on return_model_path and what's concrete (it's model or dataset))
  3. If name is available in github-storage
    • download archive to temporary directory (tempfile.mkdtemp)
    • check hash for archive, if some problems detected - raise an Exception and recommend re-run load
    • unpack it ibid + remove original archive
    • rename('/tmp/randomtempfolder/<data_name>', '~/gensim-data/<data_name>')
    • Goto (2)

Additional requirements:

  1. Store all (data + info) on GitHub (we will not proxy the links)
  2. No need to support aliases (because we'll have detailed description for data)
  3. CLI support (simple main function with argparse + needed calls)
  4. Instruction for loading new data to GitHub
  5. On GitHub, we store data, info + code for loading this to memory
  6. Tests

@piskvorky
Copy link
Owner

piskvorky commented Oct 11, 2017

Some examples would be helpful. For example, if name==None - return full json with data -- does this mean the structure of the response is different (a list of what would be returned when name != None)?

Otherwise the functionality looks great in general: I especially like the idea with the "related papers / preprocessing code".

What does No need to support aliases (because we'll have detailed description for data) mean? What are "aliases"?

For dataset: return path to folder with dataset -- what is the effect of return_model_path=False? Do we always return path, or do we return an open object (data iterator)?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 12, 2017

@piskvorky

What does No need to support aliases (because we'll have a detailed description for data) mean? What are "aliases"?

Earlier I think that this is a very useful feature (because users know datasets by different names), but now, if we add more descriptions (like related papers, etc), this is not needed.

For dataset: return path to the folder with dataset -- what is the effect of return_model_path=False? Do we always return path, or do we return an open object (data iterator)?

So, I think return_model_path should have any effect only for models (no any effects for datasets)
For datasets, I propose a path to the folder, because of a lot of datasets, for example, separated to different files.

Some examples would be helpful. For example, if name==None - return full json with data -- does this mean the structure of the response is different (a list of what would be returned if name != None)?

Agree, let me show an example

if name==None - it should be full dump (dict with information about all models)

{
	"models":
	{
		"word2vec-googlenews-300": 
		{
			"description": "Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.",
			"parameters": "dimension=300",
			"papers": "https://arxiv.org/abs/1310.4546, https://arxiv.org/abs/1301.3781",
			"dataset": "Google news",
			"language": "en"
		},

		"glove-twitter-50": 
		{
			"description": "Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/",
			"parameters": "dimension=50",
                        "preprocessing": "Converted to w2v format with `python -m gensim.scripts glove2word2vec <fname>`",
			"papers": "https://nlp.stanford.edu/pubs/glove.pdf",
			"dataset": "Twitter"
		},
		"glove-commoncrawl-300":
		{
			"description": "Pre-trained vectors, 42B tokens, 1.9M vocab, uncased. https://nlp.stanford.edu/projects/glove/",
			"parameters": "dimension=50",
                        "preprocessing": "Converted to w2v format with `python -m gensim.scripts glove2word2vec <fname>`",
			"papers": "https://nlp.stanford.edu/pubs/glove.pdf",
			"dataset": "Common Crawl"
		}
	},

	"datasets":
	{
		"text8"
		{
			"description": "Cleaned small sample from wikipedia"
		}
	}
}

If name contains in json - only "leaf", i.e. if name=="glove-commoncrawl-300", output is

{
	"description": "Pre-trained vectors, 42B tokens, 1.9M vocab, uncased. https://nlp.stanford.edu/projects/glove/",
	"parameters": "dimension=50",
                "preprocessing": "Converted to w2v format with `python -m gensim.scripts glove2word2vec <fname>`",
	"papers": "https://nlp.stanford.edu/pubs/glove.pdf",
	"dataset": "Common Crawl"
}

If name doesn't contain in json - exception will be raised.

@piskvorky
Copy link
Owner

piskvorky commented Oct 12, 2017

It all sounds good to me.

In addition, I'd suggest the combination of {resource is data + return_model_path=False (better: return_path=False?)} would return an already open object. Where open = ready for whatever activity is usually done with this corpus: simple lines iterator (just path opened with smart_open), or iterator splitting into tokens (sentences), something else. Basically open the dataset with the most-common-usecase class. If the user wants to do something else with this data resource, they'd set return_path=True and then open the corpus in another way.

@gojomo @jayantj thoughts?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 12, 2017

@piskvorky there is no single-valued, so I would not like to do so.
For me, if I loaded a dataset, I think that all of them in memory (no open('...') that raise me StopIteration after one scan).
For the one hand, line-iterator is better (as more simple and universal solution), but what's about datasets with multiple files? Raise an exception?
For another hand - tokenized dataset is good too, but how to resolve this situation (what's will be current "view" of dataset?), for this we also should store different versions: raw, tokenized, etc. I don't think that this is very useful (and also more complicated).

I don't see a good simple solution for this problem.

@piskvorky
Copy link
Owner

piskvorky commented Oct 29, 2017

@menshikh-iv I don't understand the English. What does there is no single-valued mean?

Same with if I loaded a dataset, I think that all of them in memory (no open('...') that raise me StopIteration after one scan).

My suggestion was to return a ready corpus (iterable) for return_model_path=False, and a path otherwise.

@menshikh-iv
Copy link
Contributor Author

I don't understand the English. What does there is no single-valued mean?

Same with if I loaded a dataset, I think that all of them in memory (no open('...') that raise me StopIteration after one scan).

I mean that we have no universal solution, because

  1. Dataset can be already split into several files, what's the file we should open?
  2. Typical open allows you to read only once, this is not very convenient

@piskvorky
Copy link
Owner

piskvorky commented Nov 3, 2017

OK, but even if a dataset is split across several files, we still need some code/class to access and use that dataset, right? So let's return that from load. Otherwise what's the point of the dataset? Or can you give some example of what you mean.

Same for the second point: whatever the user has to typically do with such opened file, we can do for him automatically. And if he cannot do anything, then why even include the file?

@menshikh-iv
Copy link
Contributor Author

@macks22 @akutuzov @gojomo if you want to add any model/dataset - feel free to contribute to https://github.com/RaRe-Technologies/gensim-data

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this issue Nov 26, 2017
* added download and catalogue functions

* added link and info

* modeified link and info functions

* Updated download function

* Added logging

* Added load function

* Removed unused imports

* added check for installed models

* updated download function

* Improved help for terminal

* load returns model path

* added jupyter notebook and merged code

* alternate names for load

* corrected formatting

* added checksum after download

* refactored code

* removed log file code

* added progressbar

* fixed pep8

* added tests

* added download for >2gb data

* add test for multipart

* fixed pep8

* remove tar.gz, use only .gz for all

* fix codestyle/docstrings[1]

* add module docstring

* add downloader to apiref

* Fix CLI + more documentation

* documentation for load

* renaming

* fix tests

* fix tests[2]

* add test for info

* reduce logging

* Add return_path=True example to docstring

* fix

* update & rename notebook

* Fix docstring + use ValueError when name is incorrect

* move list to global var
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

7 participants