Add dat mirror for the released dataset #1506

millette · 2017-06-15T19:03:50Z

~~Add dat://49bd045de3beb9abcb7272967e2fb16e07b96c06e15cd814f703e8581d4561e5 to https://libraries.io/data as a mirror.~~

Note that instead of a zip, the dat holds each file and the csv files were gzipped to save space. ~~The content is available: https://datproject.org/view?query=49bd045de3beb9abcb7272967e2fb16e07b96c06e15cd814f703e8581d4561e5~~

Since there's no more zip, it's probably a good idea to provide md5 for all file (both gzipped and not).

EDIT: Updated https://datproject.org link above.

EDIT#2: Dat removed on my end.

The text was updated successfully, but these errors were encountered:

joehand · 2017-06-15T21:01:23Z

Nice! Let us know when this is published as a mirror and we can host it on another peer.

Previewing massive csv files is in our TODOs for datproject.org. This would be a cool dataset to demo =).

Since there's no more zip, it's probably a good idea to provide md5 for all file (both gzipped and not).

Dat will verify the content integrity when downloading, so this won't be strictly necessary for users downloading via dat. But may still be good if you want to download via http.

millette · 2017-06-15T21:13:42Z

In this case, I'm not the original publisher of the dataset, but I could mess with the content of the dat I created (not that I will). Having the right md5 at libraries.io lets users know they got the right stuff.

rmg · 2017-07-09T00:11:02Z

Went to download the dataset and when I saw it was 5.5G I nearly opened a duplicate of this issue before finding this one. It seems like a perfect fit!

@andrew is there anything anyone could do to help out with this? Seems to mostly come down to you running dat share to create the canonical version, if I understand correctly.

edit: Hilariously, after downloading the dump, I don't actually have enough free disk space to unpack it 😊

rmg · 2017-07-10T15:24:23Z

I spun up a little droplet on digitalocean (with a dedicated volume for this, since it's too large for the on-device disk) and created, shared, and published 3 separate dats to choose from in case it helps get things rolling:

Uncompressed csv files:

~~rmg/librariesio-csv~~
dat://9cc39cf0aa559c02c34133cf2ad22e89b61b51c94b0cfbc8b0c608573620ef7b
31G

gzip compressed csv files:

~~rmg/librariesio-csv-gz~~
dat://b39e6298484f71686da3086233aca9e5d68dd7ad7d8c184b36b28c24c9ca03c3
5.6G

xz compressed csv files:

~~rmg/librariesio-csv-xz~~
dat://2659dde37819d8e6b27c3e1312b407a99a905ad31f228659e1a69afe3ac731d0
3.5G

@joehand I couldn't find any sort of compression built in to dat. Is there a preferred/recommended way of dealing with large dats? I imagine most CSVs compress pretty well given the nature of the format.

@andrew if you want to dat clone any of the above and verify the files then it might save you some legwork/bandwidth/time to just bless one/all of these.

@millette did you stop sharing your dat? I haven't been able to clone it.

edit I've deleted the above mentioned droplet and the dats no longer have any backing.

millette · 2017-07-10T16:02:16Z

Oh, the machine hosting my dat was rebooted, I just restarted sharing it. Sorry.

rmg · 2017-07-10T17:40:28Z

@millette thanks. I was able to clone and verify that our gz files have the same shasums.. I honestly expected there to be a difference due to timestamps or something, but I guess anything like that was either preserved in the original zip file or not part of the default gz header. If nothing else, we now each know each other to be equivalently nefarious?

joehand · 2017-07-10T20:10:08Z

@rmg no recommended way yet. We're planning on adding automatic compression for transport, but that isn't implemented yet.

Depending on the use case, it may be useful for them to be uncompressed. But until we add automatic compression that may make it too slow.

millette · 2017-08-27T05:09:53Z

I removed the dat I was hosting and updated the OP. Feel free to recreate it of course.

andrew · 2017-10-09T16:35:47Z

Moving this to the Backlog as we'd still like to implement it but can't see that happening in the near future.

andrew self-assigned this Jun 15, 2017

andrew added enhancement open-data labels Jun 15, 2017

andrew added the roadmap label Oct 9, 2017

andrew closed this as completed Oct 9, 2017

andrew removed their assignment Aug 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dat mirror for the released dataset #1506

Add dat mirror for the released dataset #1506

millette commented Jun 15, 2017 •

edited

Loading

joehand commented Jun 15, 2017

millette commented Jun 15, 2017 •

edited

Loading

rmg commented Jul 9, 2017 •

edited

Loading

rmg commented Jul 10, 2017 •

edited

Loading

millette commented Jul 10, 2017

rmg commented Jul 10, 2017

joehand commented Jul 10, 2017

millette commented Aug 27, 2017

andrew commented Oct 9, 2017

Add dat mirror for the released dataset #1506

Add dat mirror for the released dataset #1506

Comments

millette commented Jun 15, 2017 • edited Loading

joehand commented Jun 15, 2017

millette commented Jun 15, 2017 • edited Loading

rmg commented Jul 9, 2017 • edited Loading

rmg commented Jul 10, 2017 • edited Loading

millette commented Jul 10, 2017

rmg commented Jul 10, 2017

joehand commented Jul 10, 2017

millette commented Aug 27, 2017

andrew commented Oct 9, 2017

millette commented Jun 15, 2017 •

edited

Loading

millette commented Jun 15, 2017 •

edited

Loading

rmg commented Jul 9, 2017 •

edited

Loading

rmg commented Jul 10, 2017 •

edited

Loading