Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet #1197

Closed
jmartin-sul opened this issue Aug 19, 2019 · 6 comments
Assignees
Labels
replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code

Comments

@jmartin-sul
Copy link
Member

e.g.

PartReplicationAuditJob(sb430hm9241, services-disk18) 1 on aws_s3_west_2: not all ZippedMoabVersion parts are replicated yet: [#<ZipPart id: 13496405, size: 177992078, zipped_moab_version_id: 13378840, created_at: "2019-05-09 21:44:15", updated_at: "2019-05-09 21:44:15", md5: "595f4a6c63537ee2c1c4f06bf02a787c", create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/sb/430/...", parts_count: 1, suffix: ".zip", status: "unreplicated", last_existence_check: nil, last_checksum_validation: nil>]
PartReplicationAuditJob(bk258yz9519, services-disk17) 1 on aws_s3_west_2: not all ZippedMoabVersion parts are replicated yet: ...
PartReplicationAuditJob(gx410cs0527, services-disk16) 1 on aws_s3_west_2: not all ZippedMoabVersion parts are replicated yet: ...

this error has come up repeatedly, but so far we haven't really looked into it, afaik (assuming that this was a result of audit running before upload had finished, and that things would naturally catch up?).

i began looking into this on friday, with @peetucket, picking up investigation @mjgiarlo started earlier in the week as first responder. @mjgiarlo referenced #1194, which i took to mean him thinking that the error in this issue's title was a manifestation of the problem in #1194. reading back over the slack chat, i'm not 100% sure if that's what he was saying, or if he was just pointing out that 1194 shows "that a lack of failed jobs in resque should not reassure us" that upload was successful. regardless, i think this is a different issue, and a problem with a zip part being entirely unreplicated, as opposed to a zip part sending a partial/corrupt zip to an endpoint (as was seen in #1194).

for example, when looking in preservation catalog's DB for info about sb430hm9241, i noticed:

> ZipPart.joins({ zipped_moab_version: [:preserved_object, :zip_endpoint] }).where(preserved_objects: { druid: 'sb430hm9241'}).select('zip_parts.*, zip_endpoints.endpoint_name as endpoint_name, preserved_objects.druid as druid, zipped_moab_versions.version as version').map { |zp| [zp.druid, zp.version, zp, zp.endpoint_name] }
=> [["sb430hm9241",
  1,
  #<ZipPart:0x00000000066f3b40
   id: 13496404,
   size: 177992078,
   zipped_moab_version_id: 13378841,
   created_at: Thu, 09 May 2019 21:44:15 UTC +00:00,
   updated_at: Thu, 09 May 2019 21:44:21 UTC +00:00,
   md5: "595f4a6c63537ee2c1c4f06bf02a787c",
   create_info:
    "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/sb/430/hm/9241/sb430hm9241.v0001.zip sb430hm9241/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
   parts_count: 1,
   suffix: ".zip",
   status: "ok",
   last_existence_check: nil,
   last_checksum_validation: nil>,
  "ibm_us_south"],
 ["sb430hm9241",
  1,
  #<ZipPart:0x00000000066f39b0
   id: 13496405,
   size: 177992078,
   zipped_moab_version_id: 13378840,
   created_at: Thu, 09 May 2019 21:44:15 UTC +00:00,
   updated_at: Thu, 09 May 2019 21:44:15 UTC +00:00,
   md5: "595f4a6c63537ee2c1c4f06bf02a787c",
   create_info:
    "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/sb/430/hm/9241/sb430hm9241.v0001.zip sb430hm9241/v0001\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
   parts_count: 1,
   suffix: ".zip",
   status: "unreplicated",
   last_existence_check: nil,
   last_checksum_validation: nil>,
  "aws_s3_west_2"]]

note above that the listed size is the same for both zip parts, that the parts are both pretty small, and that there's one listed for each endpoint, as expected. but... the AWS one is listed as unreplicated. i couldn't easily find credentials and info on friday for querying AWS in prouction, so i'd be interested to pair w/ @julianmorley on that. my next instinct would be to see what we actually have up in amazon.

queries for the other druids listed at the top of the issue returned results of a similar character to the details above.

some other notes:

  • per @mjgiarlo, audit alerts for this issue have "occurred ~1.5M times in 11mos." (and "83 times in the past 11 days").
  • all honeybadger alerts were automatically resolved without comment by the deployments done for dependency update mondays. because HB auto-resolves all outstanding alerts for an app when a deployment is made. is this behavior tweakable? do others find this default sensible? i've never really understood assuming that a deployment would fix all outstanding issues, but 🤷‍♀
  • this is definitely worrisome, and i'd advocate resourcing investigation sooner rather than later (both for impacted druids, and for figuring out the root cause of the failure to properly create and upload the zips). but this is at least not a terribly widespread problem. e.g.:
> ZipPart.unreplicated.count * 1.0 / ZipPart.count
=> 0.0006588031424072938
# so about 0.066% of zip parts are seen by the catalog as unreplicated.  would be interesting to get this by size or by number of partially replicated druids, though even this crude measure is re-assuring
  • more worrisome, we have unreplicated zips spanning the entirety of the time we've been shipping zips to the cloud. e.g.:
> ZipPart.unreplicated.minimum(:created_at)
=> 2018-08-03 06:26:33 UTC
> ZipPart.unreplicated.maximum(:created_at)
=> 2019-08-15 08:14:13 UTC
# ran both queries on friday

TL;DR:

  • my current hunch is that this is a distinct issue from ZMV parts_count doesn't match zip_parts row count for fr360kt0172 #1194
  • this seems to be an even more common problem, and we should probably resource investigation of it so that we can have more confidence in our cloud archived zip copies.
    • but crude measurement indicates this is a problem with < 1% of archive copies, at least by zip part count.
@julianmorley
Copy link
Member

The 'good' news is that sb430hm9241 is on AWS:

[dlss-jmorley:/data/sdr/scripts ]$ ./list_druid_keys.sh sb430hm9241
sb/430/hm/9241/sb430hm9241.v0001.zip 

And the size of that zip on AWS matches what PresCat thinks it should be:

[dlss-jmorley:/data/sdr/scripts ]$ cat /data/sdr/archives/md/sul-sdr-aws-us-west-2-archive/sb/430/hm/9241/sb430hm9241.v0001.zip
{
    "AcceptRanges": "bytes", 
    "ContentType": "", 
    "LastModified": "Thu, 09 May 2019 21:44:17 GMT", 
    "ContentLength": 177992078, 
    "ETag": "\"450ef0d4dd7a68ae23e6876ed7bfc9c9-34\"", 
    "StorageClass": "DEEP_ARCHIVE", 
    "Metadata": {
        "size": "177992078", 
        "parts_count": "1", 
        "zip_version": "Zip 3.0 (July 5th 2008)", 
        "checksum_md5": "595f4a6c63537ee2c1c4f06bf02a787c", 
        "zip_cmd": "zip -r0X -s 10g /sdr-transfers/sb/430/hm/9241/sb430hm9241.v0001.zip sb430hm9241/v0001"
    }
}

So it's there, and has been there since May. Looks like maybe recorder didn't correctly record success?

@jmartin-sul
Copy link
Member Author

jmartin-sul commented Aug 22, 2019

hmm... so maybe there's some status updating to be done by audit, to make objects go into the ok state if they exist in the cloud as expected? that might require some storytime-ish eng design, because i think we want to do the sort of size check (at least) that you did manually here (as opposed to just seeing it out there at all, and updating to ok).

@julianmorley
Copy link
Member

Yup, it definitely looks like there's an audit opportunity there. Is it there? Does the size reported by S3 match what's in the database? Does the parts_count value match what's in the database? Are there actually that many parts on S3? etc.

@ndushay
Copy link
Contributor

ndushay commented Jan 28, 2020

See also https://app.honeybadger.io/projects/54415/faults/52770216 for rm853tx9183

@aaron-collier aaron-collier removed their assignment Apr 16, 2020
@jmartin-sul
Copy link
Member Author

jmartin-sul commented May 27, 2020

the first thing to do is to see what we have noted in pres cat's database, and what we have archived in the cloud at the moment (and thus, whether this is still an issue). here are some instructions to get started when looking at replication problems: https://github.com/sul-dlss/preservation_catalog/wiki/Investigating-a-druid-with-replication-errors

if the moabs are fully replicated to all cloud endpoints, we may just have some database cleanup to do. if everything is replicated and the database looks good, we should be able to close this without further action.

druids to check, from the description:

  • sb430hm9241
  • bk258yz9519
  • gx410cs0527

@jmartin-sul jmartin-sul changed the title not all ZippedMoabVersion parts are replicated yet sb430hm9241, bk258yz9519, gx410cs0527: not all ZippedMoabVersion parts are replicated yet May 27, 2020
@jmartin-sul jmartin-sul added catalog to archive replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code labels May 27, 2020
@aaron-collier aaron-collier self-assigned this Jun 2, 2020
@aaron-collier
Copy link
Contributor

Verified parts where replicated and match size in database, updated db for status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
replication_failure failure to replicate specific object(s), whether due to cloud provider hiccup or bug in our code
Projects
None yet
Development

No branches or pull requests

4 participants