[MRG] Data/model storage. Fix 1453 #1705

menshikh-iv · 2017-11-10T04:31:12Z

Based on #1453.
What's done:

Docstrings (fixed mistakes, added examples, etc)
Removed --catalogue (instead of this used --info, similar to python API)
Renaming / style fixes
Remove unpacking (now we save much HDD space)

piskvorky · 2017-11-10T23:10:24Z

gensim/downloader.py

+
+
+def load(name, return_path=False):
+    """Download (if needed) dataset/model and load it to memory (managed by return_path).


"managed by return_path" doesn't sound correct. Do you mean "unless return_path is set"?

Yes, that's true.

piskvorky · 2017-11-10T23:11:11Z

gensim/downloader.py

+    _create_base_dir()
+    file_name = _get_filename(name)
+    if file_name is None:
+        raise Exception(


Exception too generic. This looks like a ValueError.

piskvorky · 2017-11-10T23:12:07Z

gensim/downloader.py

+    folder_dir = os.path.join(base_dir, name)
+    path = os.path.join(folder_dir, file_name)
+    if not os.path.exists(folder_dir):
+        _download(name)


Any way to "force" a download? (fix previously broken / partial / deleted downloads etc)

This situation doesn't happen (because data moved from /tmp/.. to ~/gensim-data after downloading needed files + checking md5). For this reason, we don't add this key (if something happens - run load again.

piskvorky · 2017-11-10T23:13:33Z

gensim/downloader.py

+    else:
+        sys.path.insert(0, base_dir)
+        module = __import__(name)
+        return module.load_data()


Where do I find this load_data function? For example for Wikipedia (json-line format from segment_wiki), and for text8?

I got a bit lost in the import magic, and there are no comments.

text8 and wiki-en

menshikh-iv · 2017-11-13T05:02:55Z

Need to add the last feature:

possibility to use another list.json, it's very needed for me for very accurate adding the new things to gensim-data (for now, I must commit to master, it's OK now, but its horribly after release, it can produce "race-condition").

The process must look like this:

add the model to release page
create PR with an updated list.json (new name, hashes and so on).
check that all download/load correctly
merge PR -> new model is available for all users.

* added download and catalogue functions * added link and info * modeified link and info functions * Updated download function * Added logging * Added load function * Removed unused imports * added check for installed models * updated download function * Improved help for terminal * load returns model path * added jupyter notebook and merged code * alternate names for load * corrected formatting * added checksum after download * refactored code * removed log file code * added progressbar * fixed pep8 * added tests * added download for >2gb data * add test for multipart * fixed pep8 * remove tar.gz, use only .gz for all * fix codestyle/docstrings[1] * add module docstring * add downloader to apiref * Fix CLI + more documentation * documentation for load * renaming * fix tests * fix tests[2] * add test for info * reduce logging * Add return_path=True example to docstring * fix * update & rename notebook * Fix docstring + use ValueError when name is incorrect * move list to global var

chaitaliSaini and others added 30 commits July 30, 2017 04:59

added download and catalogue functions

ec8c016

added link and info

636bfff

modeified link and info functions

fffe203

Updated download function

f567dee

Added logging

61ba3d6

Added load function

d8257a3

Removed unused imports

5571469

added check for installed models

cabf173

updated download function

5d509fc

Improved help for terminal

551f54e

load returns model path

ff5509f

added jupyter notebook and merged code

e654070

alternate names for load

b0d1110

corrected formatting

498b32b

added checksum after download

03649b0

refactored code

7fbf228

removed log file code

d0311d1

added progressbar

7e00e2d

fixed pep8

f38670d

added tests

4cadfa2

added download for >2gb data

e844e01

add test for multipart

580a93a

fixed pep8

e899f88

remove tar.gz, use only .gz for all

f0fb2ef

fix codestyle/docstrings[1]

e1daae8

add module docstring

fc440e4

Merge branch 'develop' into datamodel

5f87a39

add downloader to apiref

1db8e09

Fix CLI + more documentation

80a2c69

documentation for load

999e5d1

renaming

b1e89e8

menshikh-iv mentioned this pull request Nov 10, 2017

[WIP] Data/model storage. Fix 1453 #1632

Closed

menshikh-iv added 7 commits November 10, 2017 10:41

fix tests

7b3429e

fix tests[2]

3c2cf55

add test for info

ba49946

reduce logging

ae6798b

Add return_path=True example to docstring

b267b12

fix

d58bebe

update & rename notebook

511cc55

piskvorky requested changes Nov 10, 2017

View reviewed changes

Fix docstring + use ValueError when name is incorrect

e3f64ab

move list to global var

1251322

menshikh-iv merged commit f99612d into develop Nov 14, 2017

menshikh-iv deleted the datamodel branch November 14, 2017 08:31

This was referenced Nov 14, 2017

Link to common datasets #746

Closed

Getting started datasets #717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Data/model storage. Fix 1453 #1705

[MRG] Data/model storage. Fix 1453 #1705

menshikh-iv commented Nov 10, 2017

piskvorky Nov 10, 2017

menshikh-iv Nov 11, 2017

piskvorky Nov 10, 2017

piskvorky Nov 10, 2017

menshikh-iv Nov 11, 2017

piskvorky Nov 10, 2017 •

edited

Loading

menshikh-iv Nov 11, 2017

menshikh-iv commented Nov 13, 2017 •

edited

Loading



		def load(name, return_path=False):
		"""Download (if needed) dataset/model and load it to memory (managed by return_path).

[MRG] Data/model storage. Fix 1453 #1705

[MRG] Data/model storage. Fix 1453 #1705

Conversation

menshikh-iv commented Nov 10, 2017

piskvorky Nov 10, 2017

Choose a reason for hiding this comment

menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

piskvorky Nov 10, 2017

Choose a reason for hiding this comment

piskvorky Nov 10, 2017

Choose a reason for hiding this comment

menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

piskvorky Nov 10, 2017 • edited Loading

Choose a reason for hiding this comment

menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

menshikh-iv commented Nov 13, 2017 • edited Loading

piskvorky Nov 10, 2017 •

edited

Loading

menshikh-iv commented Nov 13, 2017 •

edited

Loading