-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dat mirror for the released dataset #1506
Comments
Nice! Let us know when this is published as a mirror and we can host it on another peer. Previewing massive csv files is in our TODOs for datproject.org. This would be a cool dataset to demo =).
Dat will verify the content integrity when downloading, so this won't be strictly necessary for users downloading via dat. But may still be good if you want to download via http. |
In this case, I'm not the original publisher of the dataset, but I could mess with the content of the dat I created (not that I will). Having the right md5 at libraries.io lets users know they got the right stuff. |
Went to download the dataset and when I saw it was 5.5G I nearly opened a duplicate of this issue before finding this one. It seems like a perfect fit! @andrew is there anything anyone could do to help out with this? Seems to mostly come down to you running edit: Hilariously, after downloading the dump, I don't actually have enough free disk space to unpack it 😊 |
I spun up a little droplet on digitalocean (with a dedicated volume for this, since it's too large for the on-device disk) and created, shared, and published 3 separate dats to choose from in case it helps get things rolling: Uncompressed csv files:
gzip compressed csv files:
xz compressed csv files:
@joehand I couldn't find any sort of compression built in to dat. Is there a preferred/recommended way of dealing with large dats? I imagine most CSVs compress pretty well given the nature of the format. @andrew if you want to @millette did you stop sharing your dat? I haven't been able to clone it. edit I've deleted the above mentioned droplet and the dats no longer have any backing. |
Oh, the machine hosting my dat was rebooted, I just restarted sharing it. Sorry. |
@millette thanks. I was able to clone and verify that our gz files have the same shasums.. I honestly expected there to be a difference due to timestamps or something, but I guess anything like that was either preserved in the original zip file or not part of the default gz header. If nothing else, we now each know each other to be equivalently nefarious? |
@rmg no recommended way yet. We're planning on adding automatic compression for transport, but that isn't implemented yet. Depending on the use case, it may be useful for them to be uncompressed. But until we add automatic compression that may make it too slow. |
I removed the dat I was hosting and updated the OP. Feel free to recreate it of course. |
Moving this to the Backlog as we'd still like to implement it but can't see that happening in the near future. |
Adddat://49bd045de3beb9abcb7272967e2fb16e07b96c06e15cd814f703e8581d4561e5
to https://libraries.io/data as a mirror.Note that instead of a zip, the dat holds each file and the csv files were gzipped to save space.
The content is available: https://datproject.org/view?query=49bd045de3beb9abcb7272967e2fb16e07b96c06e15cd814f703e8581d4561e5Since there's no more zip, it's probably a good idea to provide md5 for all file (both gzipped and not).
EDIT: Updated https://datproject.org link above.
EDIT#2: Dat removed on my end.
The text was updated successfully, but these errors were encountered: