Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194

Closed
julianmorley opened this issue Aug 7, 2019 · 13 comments
Closed
Assignees
Labels
bug replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code replication replication related questions or issues

Comments

@julianmorley
Copy link
Member

julianmorley commented Aug 7, 2019

https://app.honeybadger.io/projects/54415/faults/39235659

PartReplicationAuditJob(fr360kt0172, services-disk16) 1 on aws_s3_west_2: ZippedMoabVersion stated parts count (1) doesn't match actual number of zip parts rows (2)

EDIT The bug here is that PresCat apparently started to create the zip file for version 1 of this druid, but failed, leaving a zip file that was only 800MB in size instead of 1.6GB. That didn't stop subsequent processes from uploading the incomplete version zip to AWS and recording it as 'done'.

@julianmorley julianmorley added bug replication replication related questions or issues labels Aug 7, 2019
@julianmorley
Copy link
Member Author

I think we need to find the extraneous zip_parts row and delete it to clear this error (and verify that there should, in fact, be only 1 .zip file for this druid-version).

@julianmorley
Copy link
Member Author

4 versions, each is < 10G, so we'd expect 1 zip part per version.

[dlss-jmorley:/data/sdr/scripts ]$ du -csh /services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v000*
1.6G	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0001
72K	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0002
62M	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0003
100K	/services-disk16/sdr2objects/fr/360/kt/0172/fr360kt0172/v0004
1.7G	total

There are 4 files for this druid on AWS west 2, one per version, as expected.

[dlss-jmorley:/data/sdr/scripts ]$ ./list_druid_keys.sh fr360kt0172
fr/360/kt/0172/fr360kt0172.v0001.zip
fr/360/kt/0172/fr360kt0172.v0002.zip
fr/360/kt/0172/fr360kt0172.v0003.zip
fr/360/kt/0172/fr360kt0172.v0004.zip

@julianmorley
Copy link
Member Author

There are four ZMVs for this druid, as expected, on endpoint 1 (AWS-west)

 pry(main)> zmv = ZippedMoabVersion.by_druid("fr360kt0172").where(zip_endpoint_id: '1')
=> [#<ZippedMoabVersion:0x00000000052c3850
  id: 8554508,
  version: 4,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Sat, 12 Jan 2019 01:54:17 UTC +00:00,
  updated_at: Sat, 12 Jan 2019 01:54:17 UTC +00:00>,
 #<ZippedMoabVersion:0x00000000052c36c0
  id: 8247572,
  version: 3,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Tue, 23 Oct 2018 20:14:38 UTC +00:00,
  updated_at: Tue, 23 Oct 2018 20:14:38 UTC +00:00>,
 #<ZippedMoabVersion:0x00000000052c3530
  id: 8237416,
  version: 2,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Mon, 22 Oct 2018 18:26:02 UTC +00:00,
  updated_at: Mon, 22 Oct 2018 18:26:02 UTC +00:00>,
 #<ZippedMoabVersion:0x00000000052c33f0
  id: 8223951,
  version: 1,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00,
  updated_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00>]

@julianmorley
Copy link
Member Author

Stepping through each of those ZMV to find the ZipPart with a problem eventually leads to this:

[7] pry(main)> ZipPart.where(zipped_moab_version_id: '8223951')
=> [#<ZipPart:0x0000000004b16c38
  id: 13353414,
  size: 1693787987,
  zipped_moab_version_id: 8223951,
  created_at: Tue, 19 Mar 2019 18:15:23 UTC +00:00,
  updated_at: Tue, 19 Mar 2019 18:15:23 UTC +00:00,
  md5: "596f353e32df0475acb9739abe9a4914",
  create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
  parts_count: 1,
  suffix: ".zip",
  status: "unreplicated",
  last_existence_check: nil,
  last_checksum_validation: nil>,
 #<ZipPart:0x0000000004b13da8
  id: 8283510,
  size: 846909867,
  zipped_moab_version_id: 8223951,
  created_at: Fri, 19 Oct 2018 21:56:16 UTC +00:00,
  updated_at: Wed, 24 Jul 2019 07:11:19 UTC +00:00,
  md5: "6ed89de292cbb13389377ade0ff5673b",
  create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
  parts_count: 1,
  suffix: ".zip",
  status: "ok",
  last_existence_check: Wed, 24 Jul 2019 07:11:19 UTC +00:00,
  last_checksum_validation: Wed, 24 Jul 2019 07:11:19 UTC +00:00>]

One is OK, and one is un-replicated. There should only be one record, but which one is the right one? You'd think it was the one with the status of "OK", but check out the file size. We know from looking at the filesystem that the version 1 of this druid should be around 1.6GB, not 840MB. What did we upload?

[dlss-jmorley:/data/sdr/scripts ]$ cat /data/sdr/archives/md/sul-sdr-aws-us-west-2-archive/fr/360/kt/0172/fr360kt0172.v0001.zip
{
    "AcceptRanges": "bytes", 
    "ContentType": "", 
    "LastModified": "Fri, 19 Oct 2018 21:56:21 GMT", 
    "ContentLength": 846909867, 
    "ETag": "\"120ac73fff4b112b904783d04d1ca2d8-162\"", 
    "StorageClass": "DEEP_ARCHIVE", 
    "Metadata": {
        "size": "846909867", 
        "parts_count": "1", 
        "zip_version": "Zip 3.0 (July 5th 2008)", 
        "checksum_md5": "6ed89de292cbb13389377ade0ff5673b", 
        "zip_cmd": "zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001"
    }
}

Huh. We uploaded 840MB. What does our IBM copy look like?

[dlss-jmorley:/data/sdr/scripts ]$ cat /data/sdr/archives/md/sul-sdr-ibm-us-south-1-prod/fr/360/kt/0172/fr360kt0172.v0001.zip
{
    "AcceptRanges": "bytes", 
    "ContentType": "application/zip", 
    "LastModified": "Tue, 19 Mar 2019 18:16:21 GMT", 
    "ContentLength": 1693787987, 
    "ETag": "\"db33d783e20b5d4075cfc43df783a242-324\"", 
    "StorageClass": "GLACIER", 
    "PartsCount": 324, 
    "Metadata": {
        "zip_version": "Zip 3.0 (July 5th 2008)", 
        "parts_count": "1", 
        "size": "1693787987", 
        "checksum_md5": "596f353e32df0475acb9739abe9a4914", 
        "zip_cmd": "zip -r0X -s 10g /sdr-transfers/fr/360/kt/0172/fr360kt0172.v0001.zip fr360kt0172/v0001"
    }
}

IBM is correct. So the fix here is to delete the object from AWS, delete the version 1 ZMV for AWS, and re-process.

But the bug here is how the heck did Zipmaker manage to create a partial zip and think it was done? And then upload it? Did the zip binary get killed partway through zip creation, and then plexer and delivery just ran with it?

@julianmorley
Copy link
Member Author

Get the ZMV we need to kill; we already know the ID from our investigations above.

[6] pry(main)> zmv = ZippedMoabVersion.find_by(id: '8223951')
=> #<ZippedMoabVersion:0x00000000076ea008
 id: 8223951,
 version: 1,
 complete_moab_id: 1435015,
 zip_endpoint_id: 1,
 created_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00,
 updated_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00>

Kill it with fire.

[7] pry(main)> zmv.destroy
=> #<ZippedMoabVersion:0x00000000076ea008
 id: 8223951,
 version: 1,
 complete_moab_id: 1435015,
 zip_endpoint_id: 1,
 created_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00,
 updated_at: Fri, 19 Oct 2018 21:55:58 UTC +00:00>

Now re-process the Druid to re-create missing ZMVs.

[8] pry(main)> CompleteMoab.by_druid('fr360kt0172').first.create_zipped_moab_versions!
=> [#<ZippedMoabVersion:0x0000000007aac538
  id: 13994168,
  version: 1,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00,
  updated_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00>]

And check our work. The new ZMV is there:

[10] pry(main)> zmv = ZippedMoabVersion.by_druid("fr360kt0172").where(zip_endpoint_id: '1')
=> [#<ZippedMoabVersion:0x0000000007b3ce58
  id: 13994168,
  version: 1,
  complete_moab_id: 1435015,
  zip_endpoint_id: 1,
  created_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00,
  updated_at: Wed, 07 Aug 2019 19:05:36 UTC +00:00>,
[11] pry(main)> ZipmakerJob.perform_later('fr360kt0172', 1)
Enqueued ZipmakerJob (Job ID: 3c0e4e4a-83e7-43de-98ec-6e4011d0567b) to Resque(zipmaker) with arguments: "fr360kt0172", 1
=> #<ZipmakerJob:0x00000000078180b0 @arguments=["fr360kt0172", 1], @executions=0, @job_id="3c0e4e4a-83e7-43de-98ec-6e4011d0567b", @priority=nil, @queue_name="zipmaker">

Verify that zipmaker made a new zip of the correct size, that the upload completed and delete the old (incorrect) metadata so it'll get refreshed.

[dlss-jmorley:/data/sdr/scripts ]$ ls -l /sdr-transfers/fr/360/kt/0172*
total 1660612
-rw-rw-r-- 1 sdr sdr 1693787987 Aug  7 12:14 fr360kt0172.v0001.zip
-rw-rw-r-- 1 sdr sdr         32 Aug  7 12:14 fr360kt0172.v0001.zip.md5

[dlss-jmorley:/data/sdr/scripts ]$ ./list_druid_keys.sh fr360kt0172
fr/360/kt/0172/fr360kt0172.v0001.zip
fr/360/kt/0172/fr360kt0172.v0002.zip
fr/360/kt/0172/fr360kt0172.v0003.zip
fr/360/kt/0172/fr360kt0172.v0004.zip

[dlss-jmorley:/data/sdr/scripts ]$ rm /data/sdr/archives/md/sul-sdr-aws-us-west-2-archive/fr/360/kt/0172/fr360kt0172.v0001.zip

And we're done! I hope.

@mjgiarlo
Copy link
Member

Heads-up ☝️, upcoming first-responders @peetucket @jermnelson

@jmartin-sul
Copy link
Member

Did the zip binary get killed partway through zip creation, and then plexer and delivery just ran with it?

that seems plausible. or hiccuped and produced a bad zip and gave a good exit code. presumably, if you run zip enough, you'll get some bad ones.

two thoughts for possible checks:

  • zip size should be around size on disk, maybe should always be greater? we're running zip w/ no compression, iirc, to help make the zips more reproducible, so it's files + metadata, which has to be larger than files, right?
  • maybe too slow, or too slow all the time, but what about running zip w/ verification? randomly some percentage of the time?

i believe both those changes would live in zip maker. though i could see audits that looked at the size on disk and compared to reported zip size in the cloud.

@peetucket
Copy link
Member

Diagram of the process for any new developers joining this ticket: https://github.com/sul-dlss/preservation_catalog/blob/master/app/jobs/README.md

@jmartin-sul
Copy link
Member

looked into the HB alerts @mjgiarlo was looking into, pairing w/ @peetucket on friday. in the course of that, i started to suspect that the HB alerts @mjgiarlo had just started investigating indicated a different issue from what's described in this ticket. filed #1197 for that, including the details i turned up working with peter on friday.

@mjgiarlo
Copy link
Member

@jmartin-sul @julianmorley I'm not quite sure what to do about this one, both in terms of reproducing it and in terms of remediating this going forward. Are there any tendrils of the busted object lying around?

AFAICT, if the zip command failed partway through, an alert should have been raised around here:

raise "zipmaker failure #{combined}" unless status.success?

Do we want to add a check below that line that validates each zip file (e.g., using unzip -t)? That will make the jobs take longer and is but a shot in the dark.

@jmartin-sul
Copy link
Member

i filed #1302 as an immediately actionable sanity check that should allow us to catch incorrectly created zip files for moab versions.

i'm leaving this ticket open for now, since there may be further investigation to do, and since there's another possible check described in #1194 (comment) which might be worth implementing (random verification of generated zip files).

if @julianmorley thinks this can be closed (e.g. because the problem occurs rarely and because #1302 might catch what occurrences we do see), i'd be fine with closing it.

@jmartin-sul
Copy link
Member

jmartin-sul commented May 21, 2020

i just put this in the ready column on the zenhub board for the Q2 2020 maintenance WC.

first thing to do would be to confirm that there is still a problem with the state of what's replicated, first by checking to see whether all expected ZippedMoabVersion and ZipParts are present and replicated/ok according to the database, then running the cloud audit code to confirm that the parts are still out there as expected.

some helpful docs for that sort of sleuthing:

@jmartin-sul jmartin-sul added the replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code label May 22, 2020
@aaron-collier aaron-collier self-assigned this May 22, 2020
@aaron-collier
Copy link
Contributor

Verified all show replicated in the database and the files exist on the cloud end points. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code replication replication related questions or issues
Projects
None yet
Development

No branches or pull requests

6 participants