-
Notifications
You must be signed in to change notification settings - Fork 2
Zip Creation
Our sole use case is atypical: uncompressed segmented files. Most people want compression: we don't. Given the criticality of long term archival preservation to this app, we also care more than most users about the stability of how the zipfiles are generated. Ideally, the same binary output (thereby the same checksums) can be achieved by this app for decades.
Product ownership required large objects to be segmented into 10GB "zip parts", essentially to compartmentalize any momentary failure, e.g. to prevent uploading (or checksumming) 349 of 350 GB before failing, achieving nothing at great cost. The aws-sdk-s3
library already supports multi-part multi-threaded uploads/downloads, so this concern was more with logic and performance on our side of the application. That parallelism is still used, but now with another level of parallelism across segments/parts.
Using zip parts introduces logical complexity that did not exist previously. In addition to simply requiring more commands (more connections) to complete a single upload, it complicates subsequent questions of status, completeness and retrieval. To enable a multi-part object to be reconstructed successfully, we must store additional metadata about parts.
If the way we build zips (parts) changes on a technical level over time, it is plausible that:
- we get different files (different checksums) for the same input
- we get a different number of parts
We address #1 by modeling the ZippedMoabVersion
per endpoint.
We address #2 by including additional metadata, including parts_count
. The s3_keys of for a complete set of parts can be generated from any one object using this count. See DruidVersionZip#expected_part_keys
.
The team considered several different implementation options, including:
-
rubyzip
ruby library - other ruby compression libraries
-
zipruby
ruby bindings for C libzip -
zip
shell executable on system
We opted for the latter. We recognize that normally shelling out to the system is not the preferred pattern, since it introduces performance cost and additional logical complexity. So this document is to explain our motivations.
-
rubyzip
is "official" and maintained, but volatile. Using it would require us to include substantially more application logic around the construction of zip parts. Given its volatility, future versions ofrubyzip
would likely break the core purpose of the app. Indeed, breaking changes were released even during the period of our active development. That would necessitate pinningrubyzip
to a known version, introducing an anchor that eventually would prevent other upgrades to the system, eventually including security fixes. -
zipruby
is unmaintained, pre-1.0
release - other ruby libraries basically suffer one or both of the same liabilities
- System
zip
is unchanged since 2008. - When creating sizable zipfiles, the additional "cost" of a shell process is negligible.
Basically, the stability, determinism and reproducibility of the common zip
executable won out.
Zip files to be preserved are created by ZipmakerJob
calling DruidVersionZip#create_zip!
.
We attacked performance issues by parallelizing zip creation via jobs/workers. Additionally, ZipmakerJob
's responsibilities are designed to be completely severable from the rest of the system. I.E., the task itself does not require database access, and the worker could be replaced by an optimized implementation in any language that can talk to Resque/Redis.
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)