sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

jmartin-sul · 2019-08-19T00:32:44Z

e.g.

https://app.honeybadger.io/projects/54415/faults/53051995 (sb430hm9241)

PartReplicationAuditJob(sb430hm9241, services-disk18) 1 on aws_s3_west_2: not all ZippedMoabVersion parts are replicated yet: [#<ZipPart id: 13496405, size: 177992078, zipped_moab_version_id: 13378840, created_at: "2019-05-09 21:44:15", updated_at: "2019-05-09 21:44:15", md5: "595f4a6c63537ee2c1c4f06bf02a787c", create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/sb/430/...", parts_count: 1, suffix: ".zip", status: "unreplicated", last_existence_check: nil, last_checksum_validation: nil>]

https://app.honeybadger.io/projects/54415/faults/53051995/0f422cf0-be65-11e9-9815-221791a6b7a8 (bk258yz9519)

PartReplicationAuditJob(bk258yz9519, services-disk17) 1 on aws_s3_west_2: not all ZippedMoabVersion parts are replicated yet: ...

https://app.honeybadger.io/projects/54415/faults/39235659 (gx410cs0527)

PartReplicationAuditJob(gx410cs0527, services-disk16) 1 on aws_s3_west_2: not all ZippedMoabVersion parts are replicated yet: ...

this error has come up repeatedly, but so far we haven't really looked into it, afaik (assuming that this was a result of audit running before upload had finished, and that things would naturally catch up?).

i began looking into this on friday, with @peetucket, picking up investigation @mjgiarlo started earlier in the week as first responder. @mjgiarlo referenced #1194, which i took to mean him thinking that the error in this issue's title was a manifestation of the problem in #1194. reading back over the slack chat, i'm not 100% sure if that's what he was saying, or if he was just pointing out that 1194 shows "that a lack of failed jobs in resque should not reassure us" that upload was successful. regardless, i think this is a different issue, and a problem with a zip part being entirely unreplicated, as opposed to a zip part sending a partial/corrupt zip to an endpoint (as was seen in #1194).

for example, when looking in preservation catalog's DB for info about sb430hm9241, i noticed:

> ZipPart.joins({ zipped_moab_version: [:preserved_object, :zip_endpoint] }).where(preserved_objects: { druid: 'sb430hm9241'}).select('zip_parts.*, zip_endpoints.endpoint_name as endpoint_name, preserved_objects.druid as druid, zipped_moab_versions.version as version').map { |zp| [zp.druid, zp.version, zp, zp.endpoint_name] }
=> [["sb430hm9241",
  1,
  #<ZipPart:0x00000000066f3b40
   id: 13496404,
   size: 177992078,
   zipped_moab_version_id: 13378841,
   created_at: Thu, 09 May 2019 21:44:15 UTC +00:00,
   updated_at: Thu, 09 May 2019 21:44:21 UTC +00:00,
   md5: "595f4a6c63537ee2c1c4f06bf02a787c",
   create_info:
    "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/sb/430/hm/9241/sb430hm9241.v0001.zip sb430hm9241/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
   parts_count: 1,
   suffix: ".zip",
   status: "ok",
   last_existence_check: nil,
   last_checksum_validation: nil>,
  "ibm_us_south"],
 ["sb430hm9241",
  1,
  #<ZipPart:0x00000000066f39b0
   id: 13496405,
   size: 177992078,
   zipped_moab_version_id: 13378840,
   created_at: Thu, 09 May 2019 21:44:15 UTC +00:00,
   updated_at: Thu, 09 May 2019 21:44:15 UTC +00:00,
   md5: "595f4a6c63537ee2c1c4f06bf02a787c",
   create_info:
    "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/sb/430/hm/9241/sb430hm9241.v0001.zip sb430hm9241/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
   parts_count: 1,
   suffix: ".zip",
   status: "unreplicated",
   last_existence_check: nil,
   last_checksum_validation: nil>,
  "aws_s3_west_2"]]

note above that the listed size is the same for both zip parts, that the parts are both pretty small, and that there's one listed for each endpoint, as expected. but... the AWS one is listed as unreplicated. i couldn't easily find credentials and info on friday for querying AWS in prouction, so i'd be interested to pair w/ @julianmorley on that. my next instinct would be to see what we actually have up in amazon.

queries for the other druids listed at the top of the issue returned results of a similar character to the details above.

some other notes:

per @mjgiarlo, audit alerts for this issue have "occurred ~1.5M times in 11mos." (and "83 times in the past 11 days").
all honeybadger alerts were automatically resolved without comment by the deployments done for dependency update mondays. because HB auto-resolves all outstanding alerts for an app when a deployment is made. is this behavior tweakable? do others find this default sensible? i've never really understood assuming that a deployment would fix all outstanding issues, but 🤷‍♀
this is definitely worrisome, and i'd advocate resourcing investigation sooner rather than later (both for impacted druids, and for figuring out the root cause of the failure to properly create and upload the zips). but this is at least not a terribly widespread problem. e.g.:

> ZipPart.unreplicated.count * 1.0 / ZipPart.count
=> 0.0006588031424072938
# so about 0.066% of zip parts are seen by the catalog as unreplicated.  would be interesting to get this by size or by number of partially replicated druids, though even this crude measure is re-assuring

more worrisome, we have unreplicated zips spanning the entirety of the time we've been shipping zips to the cloud. e.g.:

> ZipPart.unreplicated.minimum(:created_at)
=> 2018-08-03 06:26:33 UTC
> ZipPart.unreplicated.maximum(:created_at)
=> 2019-08-15 08:14:13 UTC
# ran both queries on friday

TL;DR:

my current hunch is that this is a distinct issue from ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194
this seems to be an even more common problem, and we should probably resource investigation of it so that we can have more confidence in our cloud archived zip copies.
- but crude measurement indicates this is a problem with < 1% of archive copies, at least by zip part count.

The text was updated successfully, but these errors were encountered:

julianmorley · 2019-08-22T20:55:37Z

The 'good' news is that sb430hm9241 is on AWS:

[dlss-jmorley:/data/sdr/scripts ]$ ./list_druid_keys.sh sb430hm9241
sb/430/hm/9241/sb430hm9241.v0001.zip

And the size of that zip on AWS matches what PresCat thinks it should be:

[dlss-jmorley:/data/sdr/scripts ]$ cat /data/sdr/archives/md/sul-sdr-aws-us-west-2-archive/sb/430/hm/9241/sb430hm9241.v0001.zip
{
    "AcceptRanges": "bytes", 
    "ContentType": "", 
    "LastModified": "Thu, 09 May 2019 21:44:17 GMT", 
    "ContentLength": 177992078, 
    "ETag": "\"450ef0d4dd7a68ae23e6876ed7bfc9c9-34\"", 
    "StorageClass": "DEEP_ARCHIVE", 
    "Metadata": {
        "size": "177992078", 
        "parts_count": "1", 
        "zip_version": "Zip 3.0 (July 5th 2008)", 
        "checksum_md5": "595f4a6c63537ee2c1c4f06bf02a787c", 
        "zip_cmd": "zip -r0X -s 10g /sdr-transfers/sb/430/hm/9241/sb430hm9241.v0001.zip sb430hm9241/v0001"
    }
}

So it's there, and has been there since May. Looks like maybe recorder didn't correctly record success?

jmartin-sul · 2019-08-22T21:42:28Z

hmm... so maybe there's some status updating to be done by audit, to make objects go into the ok state if they exist in the cloud as expected? that might require some storytime-ish eng design, because i think we want to do the sort of size check (at least) that you did manually here (as opposed to just seeing it out there at all, and updating to ok).

julianmorley · 2019-08-23T17:36:52Z

Yup, it definitely looks like there's an audit opportunity there. Is it there? Does the size reported by S3 match what's in the database? Does the parts_count value match what's in the database? Are there actually that many parts on S3? etc.

ndushay · 2020-01-28T00:11:38Z

See also https://app.honeybadger.io/projects/54415/faults/52770216 for rm853tx9183

jmartin-sul · 2020-05-27T22:35:08Z

the first thing to do is to see what we have noted in pres cat's database, and what we have archived in the cloud at the moment (and thus, whether this is still an issue). here are some instructions to get started when looking at replication problems: https://github.com/sul-dlss/preservation_catalog/wiki/Investigating-a-druid-with-replication-errors

if the moabs are fully replicated to all cloud endpoints, we may just have some database cleanup to do. if everything is replicated and the database looks good, we should be able to close this without further action.

druids to check, from the description:

sb430hm9241
bk258yz9519
gx410cs0527

aaron-collier · 2020-06-03T16:27:47Z

Verified parts where replicated and match size in database, updated db for status.

jmartin-sul mentioned this issue Aug 19, 2019

ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

Closed

aaron-collier self-assigned this Dec 16, 2019

jmartin-sul mentioned this issue Jan 24, 2020

[preservation_catalog/prod] Aws::S3::MultipartUploadError: multipart upload failed: The provided 'x-amz-content-sha256' header does not match what was computed. #1340

Closed

aaron-collier removed their assignment Apr 16, 2020

jmartin-sul changed the title ~~not all ZippedMoabVersion parts are replicated yet~~ sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet May 27, 2020

jmartin-sul added catalog to archive replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code labels May 27, 2020

aaron-collier self-assigned this Jun 2, 2020

aaron-collier closed this as completed Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

jmartin-sul commented Aug 19, 2019

julianmorley commented Aug 22, 2019

jmartin-sul commented Aug 22, 2019 •

edited

Loading

julianmorley commented Aug 23, 2019

ndushay commented Jan 28, 2020

jmartin-sul commented May 27, 2020 •

edited by aaron-collier

Loading

aaron-collier commented Jun 3, 2020

sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

Comments

jmartin-sul commented Aug 19, 2019

julianmorley commented Aug 22, 2019

jmartin-sul commented Aug 22, 2019 • edited Loading

julianmorley commented Aug 23, 2019

ndushay commented Jan 28, 2020

jmartin-sul commented May 27, 2020 • edited by aaron-collier Loading

aaron-collier commented Jun 3, 2020

jmartin-sul commented Aug 22, 2019 •

edited

Loading

jmartin-sul commented May 27, 2020 •

edited by aaron-collier

Loading