Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash verification/regeneration APIs #5867

Closed
donsizemore opened this issue May 21, 2019 · 8 comments
Closed

Hash verification/regeneration APIs #5867

donsizemore opened this issue May 21, 2019 · 8 comments
Assignees
Milestone

Comments

@donsizemore
Copy link
Contributor

In comparing our Postgres checksumvalues to the checksums contained in our iRODS preservation instances, I found that the majority of our file metadata, having been imported into ~4.6 back in 2016, have empty checksumvalue fields.

I could write some python to manually calculate and correct blank entries, but I'd love a checksum answer to say the datafile integrity or dataset integrity endpoints: say one to populate missing checksumvalues; another to verify existing checksumvalues against the filesystem.

@donsizemore
Copy link
Contributor Author

donsizemore commented May 21, 2019

Akio found updateHashValues in the code, but this will regenerate ALL hash values. An API endpoint to verify existing checksums and populate empty checksum fields would be a home run. Red Sox land seems like a good place to ask for those ;)

@pdurbin
Copy link
Member

pdurbin commented May 21, 2019

Ah, I see what you mean. updateHashValues is documented at http://guides.dataverse.org/en/4.14/installation/config.html#filefixitychecksumalgorithm and it looks like it was added by @qqmyers in pull request #5035

Over at #4131 (comment) I see the @landreev wrote "Recalculated and added the missing MD5s."

@donsizemore it sounds like you want a little more control over the process, maybe updating one file a time or all the files that don't have a checksum. And, like you said, a readonly "tell if the checksum still matches" API endpoint.

@donsizemore
Copy link
Contributor Author

@pdurbin not so much about control as integrity — checksums are generated at upload, and to test for say bitrot on our Dataverse storage we wouldn't want to regenerate existing checksums.

@qqmyers
Copy link
Member

qqmyers commented May 21, 2019

FWIW: #5035 didn't expose a separate verify endpoint, but it does verify the existing hash before calculating one with a different algorithm - it should be possible to reuse that code in a verifyHashes endpoint...

@djbrooke
Copy link
Contributor

djbrooke commented Aug 7, 2019

@donsizemore @pdurbin and @qqmyers, thanks for the discussion here.

Just so I understand, a deliverable here would be an API endpoint that could be passed a specific file in order to verify the hash and another API endpoint that could be passed a specific file for which to regenerate a hash?

@donsizemore
Copy link
Contributor Author

@djbrooke that would be my optimal scenario: one endpoint to verify a match or report a mismatch, another endpoint to regenerate and/or populate a NULL

@djbrooke djbrooke changed the title best way to correct empty checksumvalues Hash verification/regeneration APIs Aug 7, 2019
@djbrooke
Copy link
Contributor

djbrooke commented Aug 7, 2019

Thanks, I retitled this. I'm OK if we split this into two or if we deliver these both together.

@djbrooke djbrooke self-assigned this Aug 14, 2019
@djbrooke djbrooke removed their assignment Aug 14, 2019
@sekmiller sekmiller self-assigned this Sep 23, 2019
sekmiller added a commit that referenced this issue Sep 30, 2019
@pdurbin pdurbin added this to the 4.17 milestone Oct 12, 2019
@pdurbin
Copy link
Member

pdurbin commented Oct 17, 2019

This was delivered in pull request #6228 as part of Dataverse 4.17 and is documented at http://guides.dataverse.org/en/4.17/api/native-api.html#id15

I suspect it says "id15" instead of a normal anchor because of a conflict with this older anchor: http://guides.dataverse.org/en/4.17/api/native-api.html#datafile-integrity

I guess I'll close this but it would be nice to fix up that anchor at some point.

@pdurbin pdurbin closed this as completed Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants