-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194
Comments
I think we need to find the extraneous zip_parts row and delete it to clear this error (and verify that there should, in fact, be only 1 .zip file for this druid-version). |
4 versions, each is < 10G, so we'd expect 1 zip part per version.
There are 4 files for this druid on AWS west 2, one per version, as expected.
|
There are four ZMVs for this druid, as expected, on endpoint 1 (AWS-west)
|
Stepping through each of those ZMV to find the ZipPart with a problem eventually leads to this:
One is OK, and one is un-replicated. There should only be one record, but which one is the right one? You'd think it was the one with the status of "OK", but check out the file size. We know from looking at the filesystem that the version 1 of this druid should be around 1.6GB, not 840MB. What did we upload?
Huh. We uploaded 840MB. What does our IBM copy look like?
IBM is correct. So the fix here is to delete the object from AWS, delete the version 1 ZMV for AWS, and re-process. But the bug here is how the heck did Zipmaker manage to create a partial zip and think it was done? And then upload it? Did the zip binary get killed partway through zip creation, and then plexer and delivery just ran with it? |
Get the ZMV we need to kill; we already know the ID from our investigations above.
Kill it with fire.
Now re-process the Druid to re-create missing ZMVs.
And check our work. The new ZMV is there:
Verify that zipmaker made a new zip of the correct size, that the upload completed and delete the old (incorrect) metadata so it'll get refreshed.
And we're done! I hope. |
Heads-up ☝️, upcoming first-responders @peetucket @jermnelson |
that seems plausible. or hiccuped and produced a bad zip and gave a good exit code. presumably, if you run zip enough, you'll get some bad ones. two thoughts for possible checks:
i believe both those changes would live in zip maker. though i could see audits that looked at the size on disk and compared to reported zip size in the cloud. |
Diagram of the process for any new developers joining this ticket: https://github.com/sul-dlss/preservation_catalog/blob/master/app/jobs/README.md |
looked into the HB alerts @mjgiarlo was looking into, pairing w/ @peetucket on friday. in the course of that, i started to suspect that the HB alerts @mjgiarlo had just started investigating indicated a different issue from what's described in this ticket. filed #1197 for that, including the details i turned up working with peter on friday. |
@jmartin-sul @julianmorley I'm not quite sure what to do about this one, both in terms of reproducing it and in terms of remediating this going forward. Are there any tendrils of the busted object lying around? AFAICT, if the zip command failed partway through, an alert should have been raised around here:
Do we want to add a check below that line that validates each zip file (e.g., using |
i filed #1302 as an immediately actionable sanity check that should allow us to catch incorrectly created zip files for moab versions. i'm leaving this ticket open for now, since there may be further investigation to do, and since there's another possible check described in #1194 (comment) which might be worth implementing (random verification of generated zip files). if @julianmorley thinks this can be closed (e.g. because the problem occurs rarely and because #1302 might catch what occurrences we do see), i'd be fine with closing it. |
i just put this in the ready column on the zenhub board for the Q2 2020 maintenance WC. first thing to do would be to confirm that there is still a problem with the state of what's replicated, first by checking to see whether all expected some helpful docs for that sort of sleuthing: |
Verified all show replicated in the database and the files exist on the cloud end points. Closing. |
https://app.honeybadger.io/projects/54415/faults/39235659
PartReplicationAuditJob(fr360kt0172, services-disk16) 1 on aws_s3_west_2: ZippedMoabVersion stated parts count (1) doesn't match actual number of zip parts rows (2)
EDIT The bug here is that PresCat apparently started to create the zip file for version 1 of this druid, but failed, leaving a zip file that was only 800MB in size instead of 1.6GB. That didn't stop subsequent processes from uploading the incomplete version zip to AWS and recording it as 'done'.
The text was updated successfully, but these errors were encountered: