ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

julianmorley · 2019-08-07T16:12:50Z

https://app.honeybadger.io/projects/54415/faults/39235659

PartReplicationAuditJob(fr360kt0172, services-disk16) 1 on aws_s3_west_2: ZippedMoabVersion stated parts count (1) doesn't match actual number of zip parts rows (2)

EDIT The bug here is that PresCat apparently started to create the zip file for version 1 of this druid, but failed, leaving a zip file that was only 800MB in size instead of 1.6GB. That didn't stop subsequent processes from uploading the incomplete version zip to AWS and recording it as 'done'.

julianmorley · 2019-08-07T16:20:50Z

I think we need to find the extraneous zip_parts row and delete it to clear this error (and verify that there should, in fact, be only 1 .zip file for this druid-version).

julianmorley · 2019-08-07T18:27:39Z

4 versions, each is < 10G, so we'd expect 1 zip part per version.

[dlss-jmorley:/data/sdr/scripts ]$ du -csh /services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v000*
1.6G	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0001
72K	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0002
62M	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0003
100K	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0004
1.7G	total

There are 4 files for this druid on AWS west 2, one per version, as expected.

[dlss-jmorley:/data/sdr/scripts ]$ ./list_druid_keys.sh fr360kt0172
fr/360/kt/0172/fr360kt0172.v0001.zip
fr/360/kt/0172/fr360kt0172.v0002.zip
fr/360/kt/0172/fr360kt0172.v0003.zip
fr/360/kt/0172/fr360kt0172.v0004.zip

julianmorley · 2019-08-07T18:28:05Z

There are four ZMVs for this druid, as expected, on endpoint 1 (AWS-west)

 pry(main)> zmv = ZippedMoabVersion.by_druid("fr360kt0172").where(zip_endpoint_id: '1')
=> [#<ZippedMoabVersion:0x00000000052c3850
  id: 8554508,
  version: 4,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Sat, 12 Jan 2019 01:54:17 UTC +00:00,
  updated_at: Sat, 12 Jan 2019 01:54:17 UTC +00:00>,
 #<ZippedMoabVersion:0x00000000052c36c0
  id: 8247572,
  version: 3,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Tue, 23 Oct 2018 20:14:38 UTC +00:00,
  updated_at: Tue, 23 Oct 2018 20:14:38 UTC +00:00>,
 #<ZippedMoabVersion:0x00000000052c3530
  id: 8237416,
  version: 2,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Mon, 22 Oct 2018 18:26:02 UTC +00:00,
  updated_at: Mon, 22 Oct 2018 18:26:02 UTC +00:00>,
 #<ZippedMoabVersion:0x00000000052c33f0
  id: 8223951,
  version: 1,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00,
  updated_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00>]

julianmorley · 2019-08-07T18:57:48Z

Stepping through each of those ZMV to find the ZipPart with a problem eventually leads to this:

[7] pry(main)> ZipPart.where(zipped_moab_version_id: '8223951')
=> [#<ZipPart:0x0000000004b16c38
  id: 13353414,
  size: 1693787987,
  zipped_moab_version_id: 8223951,
  created_at: Tue, 19 Mar 2019 18:15:23 UTC +00:00,
  updated_at: Tue, 19 Mar 2019 18:15:23 UTC +00:00,
  md5: "596f353e32df0475acb9739abe9a4914",
  create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
  parts_count: 1,
  suffix: ".zip",
  status: "unreplicated",
  last_existence_check: nil,
  last_checksum_validation: nil>,
 #<ZipPart:0x0000000004b13da8
  id: 8283510,
  size: 846909867,
  zipped_moab_version_id: 8223951,
  created_at: Fri, 19 Oct 2018 21:56:16 UTC +00:00,
  updated_at: Wed, 24 Jul 2019 07:11:19 UTC +00:00,
  md5: "6ed89de292cbb13389377ade0ff5673b",
  create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
  parts_count: 1,
  suffix: ".zip",
  status: "ok",
  last_existence_check: Wed, 24 Jul 2019 07:11:19 UTC +00:00,
  last_checksum_validation: Wed, 24 Jul 2019 07:11:19 UTC +00:00>]

One is OK, and one is un-replicated. There should only be one record, but which one is the right one? You'd think it was the one with the status of "OK", but check out the file size. We know from looking at the filesystem that the version 1 of this druid should be around 1.6GB, not 840MB. What did we upload?

[dlss-jmorley:/data/sdr/scripts ]$ cat /data/sdr/archives/md/sul-sdr-aws-us-west-2-archive/fr/360/kt/0172/fr360kt0172.v0001.zip
{
    "AcceptRanges": "bytes", 
    "ContentType": "", 
    "LastModified": "Fri, 19 Oct 2018 21:56:21 GMT", 
    "ContentLength": 846909867, 
    "ETag": "\"120ac73fff4b112b904783d04d1ca2d8-162\"", 
    "StorageClass": "DEEP_ARCHIVE", 
    "Metadata": {
        "size": "846909867", 
        "parts_count": "1", 
        "zip_version": "Zip 3.0 (July 5th 2008)", 
        "checksum_md5": "6ed89de292cbb13389377ade0ff5673b", 
        "zip_cmd": "zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001"
    }
}

Huh. We uploaded 840MB. What does our IBM copy look like?

[dlss-jmorley:/data/sdr/scripts ]$ cat /data/sdr/archives/md/sul-sdr-ibm-us-south-1-prod/fr/360/kt/0172/fr360kt0172.v0001.zip
{
    "AcceptRanges": "bytes", 
    "ContentType": "application/zip", 
    "LastModified": "Tue, 19 Mar 2019 18:16:21 GMT", 
    "ContentLength": 1693787987, 
    "ETag": "\"db33d783e20b5d4075cfc43df783a242-324\"", 
    "StorageClass": "GLACIER", 
    "PartsCount": 324, 
    "Metadata": {
        "zip_version": "Zip 3.0 (July 5th 2008)", 
        "parts_count": "1", 
        "size": "1693787987", 
        "checksum_md5": "596f353e32df0475acb9739abe9a4914", 
        "zip_cmd": "zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001"
    }
}

IBM is correct. So the fix here is to delete the object from AWS, delete the version 1 ZMV for AWS, and re-process.

But the bug here is how the heck did Zipmaker manage to create a partial zip and think it was done? And then upload it? Did the zip binary get killed partway through zip creation, and then plexer and delivery just ran with it?

julianmorley · 2019-08-07T19:19:30Z

Get the ZMV we need to kill; we already know the ID from our investigations above.

[6] pry(main)> zmv = ZippedMoabVersion.find_by(id: '8223951')
=> #<ZippedMoabVersion:0x00000000076ea008
 id: 8223951,
 version: 1,
 complete_moab_id: 1435015,
 zip_endpoint_id: 1,
 created_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00,
 updated_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00>

Kill it with fire.

[7] pry(main)> zmv.destroy
=> #<ZippedMoabVersion:0x00000000076ea008
 id: 8223951,
 version: 1,
 complete_moab_id: 1435015,
 zip_endpoint_id: 1,
 created_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00,
 updated_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00>

Now re-process the Druid to re-create missing ZMVs.

[8] pry(main)> CompleteMoab.by_druid('fr360kt0172').first.create_zipped_moab_versions!
=> [#<ZippedMoabVersion:0x0000000007aac538
  id: 13994168,
  version: 1,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00,
  updated_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00>]

And check our work. The new ZMV is there:

[10] pry(main)> zmv = ZippedMoabVersion.by_druid("fr360kt0172").where(zip_endpoint_id: '1')
=> [#<ZippedMoabVersion:0x0000000007b3ce58
  id: 13994168,
  version: 1,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00,
  updated_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00>,
[11] pry(main)> ZipmakerJob.perform_later('fr360kt0172', 1)
Enqueued ZipmakerJob (Job ID: 3c0e4e4a-83e7-43de-98ec-6e4011d0567b) to Resque(zipmaker) with arguments: "fr360kt0172", 1
=> #<ZipmakerJob:0x00000000078180b0 @arguments=["fr360kt0172", 1], @executions=0, @job_id="3c0e4e4a-83e7-43de-98ec-6e4011d0567b", @priority=nil, @queue_name="zipmaker">

Verify that zipmaker made a new zip of the correct size, that the upload completed and delete the old (incorrect) metadata so it'll get refreshed.

[dlss-jmorley:/data/sdr/scripts ]$ ls -l /sdr-transfers/fr/360/kt/0172*
total 1660612
-rw-rw-r-- 1 sdr sdr 1693787987 Aug  7 12:14 fr360kt0172.v0001.zip
-rw-rw-r-- 1 sdr sdr         32 Aug  7 12:14 fr360kt0172.v0001.zip.md5

[dlss-jmorley:/data/sdr/scripts ]$ ./list_druid_keys.sh fr360kt0172
fr/360/kt/0172/fr360kt0172.v0001.zip
fr/360/kt/0172/fr360kt0172.v0002.zip
fr/360/kt/0172/fr360kt0172.v0003.zip
fr/360/kt/0172/fr360kt0172.v0004.zip

[dlss-jmorley:/data/sdr/scripts ]$ rm /data/sdr/archives/md/sul-sdr-aws-us-west-2-archive/fr/360/kt/0172/fr360kt0172.v0001.zip

And we're done! I hope.

mjgiarlo · 2019-08-15T22:11:01Z

Heads-up ☝️, upcoming first-responders @peetucket @jermnelson

jmartin-sul · 2019-08-16T01:09:21Z

Did the zip binary get killed partway through zip creation, and then plexer and delivery just ran with it?

that seems plausible. or hiccuped and produced a bad zip and gave a good exit code. presumably, if you run zip enough, you'll get some bad ones.

two thoughts for possible checks:

zip size should be around size on disk, maybe should always be greater? we're running zip w/ no compression, iirc, to help make the zips more reproducible, so it's files + metadata, which has to be larger than files, right?
maybe too slow, or too slow all the time, but what about running zip w/ verification? randomly some percentage of the time?

i believe both those changes would live in zip maker. though i could see audits that looked at the size on disk and compared to reported zip size in the cloud.

peetucket · 2019-08-16T23:08:13Z

Diagram of the process for any new developers joining this ticket: https://github.com/sul-dlss/preservation_catalog/blob/master/app/jobs/README.md

jmartin-sul · 2019-08-19T00:34:01Z

looked into the HB alerts @mjgiarlo was looking into, pairing w/ @peetucket on friday. in the course of that, i started to suspect that the HB alerts @mjgiarlo had just started investigating indicated a different issue from what's described in this ticket. filed #1197 for that, including the details i turned up working with peter on friday.

mjgiarlo · 2019-12-18T00:45:56Z

@jmartin-sul @julianmorley I'm not quite sure what to do about this one, both in terms of reproducing it and in terms of remediating this going forward. Are there any tendrils of the busted object lying around?

AFAICT, if the zip command failed partway through, an alert should have been raised around here:

preservation_catalog/app/models/druid_version_zip.rb

Line 34 in d070ec6

raise "zipmaker failure #{combined}" unless status.success?

Do we want to add a check below that line that validates each zip file (e.g., using unzip -t)? That will make the jobs take longer and is but a shot in the dark.

jmartin-sul · 2020-01-07T17:45:32Z

i filed #1302 as an immediately actionable sanity check that should allow us to catch incorrectly created zip files for moab versions.

i'm leaving this ticket open for now, since there may be further investigation to do, and since there's another possible check described in #1194 (comment) which might be worth implementing (random verification of generated zip files).

if @julianmorley thinks this can be closed (e.g. because the problem occurs rarely and because #1302 might catch what occurrences we do see), i'd be fine with closing it.

jmartin-sul · 2020-05-21T22:35:08Z

i just put this in the ready column on the zenhub board for the Q2 2020 maintenance WC.

first thing to do would be to confirm that there is still a problem with the state of what's replicated, first by checking to see whether all expected ZippedMoabVersion and ZipParts are present and replicated/ok according to the database, then running the cloud audit code to confirm that the parts are still out there as expected.

some helpful docs for that sort of sleuthing:

aaron-collier · 2020-05-22T18:45:48Z

Verified all show replicated in the database and the files exist on the cloud end points. Closing.

julianmorley added bug replication replication related questions or issues labels Aug 7, 2019

jmartin-sul mentioned this issue Aug 19, 2019

sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

Closed

ndushay added the SDR Fall 2019 label Aug 22, 2019

ndushay removed the SDR Fall 2019 label Aug 22, 2019

jmartin-sul mentioned this issue Jan 7, 2020

check size of archive zip file(s) upon creation #1302

Closed

jmartin-sul mentioned this issue Feb 22, 2020

fr360kt0172 seems to have a duplicate content file, maybe as a result of an scp mistake? (services-disk16) #1399

Closed

jmartin-sul added the replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code label May 22, 2020

aaron-collier self-assigned this May 22, 2020

aaron-collier closed this as completed May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

julianmorley commented Aug 7, 2019 •

edited

Loading

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

mjgiarlo commented Aug 15, 2019

jmartin-sul commented Aug 16, 2019

peetucket commented Aug 16, 2019

jmartin-sul commented Aug 19, 2019

mjgiarlo commented Dec 18, 2019

jmartin-sul commented Jan 7, 2020

jmartin-sul commented May 21, 2020 •

edited

Loading

aaron-collier commented May 22, 2020

ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

Comments

julianmorley commented Aug 7, 2019 • edited Loading

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

julianmorley commented Aug 7, 2019

mjgiarlo commented Aug 15, 2019

jmartin-sul commented Aug 16, 2019

peetucket commented Aug 16, 2019

jmartin-sul commented Aug 19, 2019

mjgiarlo commented Dec 18, 2019

jmartin-sul commented Jan 7, 2020

jmartin-sul commented May 21, 2020 • edited Loading

aaron-collier commented May 22, 2020

julianmorley commented Aug 7, 2019 •

edited

Loading

jmartin-sul commented May 21, 2020 •

edited

Loading