-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197
Comments
The 'good' news is that sb430hm9241 is on AWS:
And the size of that zip on AWS matches what PresCat thinks it should be:
So it's there, and has been there since May. Looks like maybe recorder didn't correctly record success? |
hmm... so maybe there's some status updating to be done by audit, to make objects go into the ok state if they exist in the cloud as expected? that might require some storytime-ish eng design, because i think we want to do the sort of size check (at least) that you did manually here (as opposed to just seeing it out there at all, and updating to ok). |
Yup, it definitely looks like there's an audit opportunity there. Is it there? Does the size reported by S3 match what's in the database? Does the parts_count value match what's in the database? Are there actually that many parts on S3? etc. |
See also https://app.honeybadger.io/projects/54415/faults/52770216 for rm853tx9183 |
the first thing to do is to see what we have noted in pres cat's database, and what we have archived in the cloud at the moment (and thus, whether this is still an issue). here are some instructions to get started when looking at replication problems: https://github.com/sul-dlss/preservation_catalog/wiki/Investigating-a-druid-with-replication-errors if the moabs are fully replicated to all cloud endpoints, we may just have some database cleanup to do. if everything is replicated and the database looks good, we should be able to close this without further action. druids to check, from the description:
|
Verified parts where replicated and match size in database, updated db for status. |
e.g.
sb430hm9241
)bk258yz9519
)gx410cs0527
)this error has come up repeatedly, but so far we haven't really looked into it, afaik (assuming that this was a result of audit running before upload had finished, and that things would naturally catch up?).
i began looking into this on friday, with @peetucket, picking up investigation @mjgiarlo started earlier in the week as first responder. @mjgiarlo referenced #1194, which i took to mean him thinking that the error in this issue's title was a manifestation of the problem in #1194. reading back over the slack chat, i'm not 100% sure if that's what he was saying, or if he was just pointing out that 1194 shows "that a lack of failed jobs in resque should not reassure us" that upload was successful. regardless, i think this is a different issue, and a problem with a zip part being entirely unreplicated, as opposed to a zip part sending a partial/corrupt zip to an endpoint (as was seen in #1194).
for example, when looking in preservation catalog's DB for info about
sb430hm9241
, i noticed:note above that the listed size is the same for both zip parts, that the parts are both pretty small, and that there's one listed for each endpoint, as expected. but... the AWS one is listed as
unreplicated
. i couldn't easily find credentials and info on friday for querying AWS in prouction, so i'd be interested to pair w/ @julianmorley on that. my next instinct would be to see what we actually have up in amazon.queries for the other druids listed at the top of the issue returned results of a similar character to the details above.
some other notes:
TL;DR:
The text was updated successfully, but these errors were encountered: