Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dat mirror for the released dataset #1506

Closed
millette opened this issue Jun 15, 2017 · 9 comments
Closed

Add dat mirror for the released dataset #1506

millette opened this issue Jun 15, 2017 · 9 comments

Comments

@millette
Copy link

millette commented Jun 15, 2017

Add dat://49bd045de3beb9abcb7272967e2fb16e07b96c06e15cd814f703e8581d4561e5 to https://libraries.io/data as a mirror.

Note that instead of a zip, the dat holds each file and the csv files were gzipped to save space. The content is available: https://datproject.org/view?query=49bd045de3beb9abcb7272967e2fb16e07b96c06e15cd814f703e8581d4561e5

Since there's no more zip, it's probably a good idea to provide md5 for all file (both gzipped and not).

EDIT: Updated https://datproject.org link above.

EDIT#2: Dat removed on my end.

@joehand
Copy link

joehand commented Jun 15, 2017

Nice! Let us know when this is published as a mirror and we can host it on another peer.

Previewing massive csv files is in our TODOs for datproject.org. This would be a cool dataset to demo =).

Since there's no more zip, it's probably a good idea to provide md5 for all file (both gzipped and not).

Dat will verify the content integrity when downloading, so this won't be strictly necessary for users downloading via dat. But may still be good if you want to download via http.

@millette
Copy link
Author

millette commented Jun 15, 2017

In this case, I'm not the original publisher of the dataset, but I could mess with the content of the dat I created (not that I will). Having the right md5 at libraries.io lets users know they got the right stuff.

@rmg
Copy link

rmg commented Jul 9, 2017

Went to download the dataset and when I saw it was 5.5G I nearly opened a duplicate of this issue before finding this one. It seems like a perfect fit!

@andrew is there anything anyone could do to help out with this? Seems to mostly come down to you running dat share to create the canonical version, if I understand correctly.

edit: Hilariously, after downloading the dump, I don't actually have enough free disk space to unpack it 😊

@rmg
Copy link

rmg commented Jul 10, 2017

I spun up a little droplet on digitalocean (with a dedicated volume for this, since it's too large for the on-device disk) and created, shared, and published 3 separate dats to choose from in case it helps get things rolling:

Uncompressed csv files:

  • rmg/librariesio-csv
  • dat://9cc39cf0aa559c02c34133cf2ad22e89b61b51c94b0cfbc8b0c608573620ef7b
  • 31G

gzip compressed csv files:

  • rmg/librariesio-csv-gz
  • dat://b39e6298484f71686da3086233aca9e5d68dd7ad7d8c184b36b28c24c9ca03c3
  • 5.6G

xz compressed csv files:

  • rmg/librariesio-csv-xz
  • dat://2659dde37819d8e6b27c3e1312b407a99a905ad31f228659e1a69afe3ac731d0
  • 3.5G

@joehand I couldn't find any sort of compression built in to dat. Is there a preferred/recommended way of dealing with large dats? I imagine most CSVs compress pretty well given the nature of the format.

@andrew if you want to dat clone any of the above and verify the files then it might save you some legwork/bandwidth/time to just bless one/all of these.

@millette did you stop sharing your dat? I haven't been able to clone it.


edit I've deleted the above mentioned droplet and the dats no longer have any backing.

@millette
Copy link
Author

Oh, the machine hosting my dat was rebooted, I just restarted sharing it. Sorry.

@rmg
Copy link

rmg commented Jul 10, 2017

@millette thanks. I was able to clone and verify that our gz files have the same shasums.. I honestly expected there to be a difference due to timestamps or something, but I guess anything like that was either preserved in the original zip file or not part of the default gz header. If nothing else, we now each know each other to be equivalently nefarious?

@joehand
Copy link

joehand commented Jul 10, 2017

@rmg no recommended way yet. We're planning on adding automatic compression for transport, but that isn't implemented yet.

Depending on the use case, it may be useful for them to be uncompressed. But until we add automatic compression that may make it too slow.

@millette
Copy link
Author

I removed the dat I was hosting and updated the OP. Feel free to recreate it of course.

@andrew andrew added the roadmap label Oct 9, 2017
@andrew
Copy link
Contributor

andrew commented Oct 9, 2017

Moving this to the Backlog as we'd still like to implement it but can't see that happening in the near future.

@andrew andrew closed this as completed Oct 9, 2017
@andrew andrew removed their assignment Aug 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants