A locally runnable submission package validator with human readable and useful messages.
- ruby 2.4.x or later
- bundler (
gem install bundler
once Ruby is installed) - git (
apt-get install git
(Debian/Ubuntu) oryum install git
(Fedora/RedHat/CentOS)) - zlib (
apt-get install zlib1g-dev
(Debian/Ubuntu) oryum install libzlib-devel
(Fedora/RedHat/CentOS)) - Mac OS X will likely require XCode command-line tools to be installed
We recommend installing ruby via rbenv with ruby-build or RVM
There is a stand-alone executable as well as an installer available for Windows. Both are available from the releases page. No other pre-requisites are required. These releases have only been tested on 64-bit Windows 10, but are likely to work on earlier versions of Windows as well.
For Windows, download a release; there is an installer as well as a stand-alone executable that doesn't require installation.
For Linux and Mac OS X, download and extract a
release, or git clone https://github.com/hathitrust/ht_sip_validator
for the latest
development version. Then:
cd ht_sip_validator
bundle install
Run the validator by providing a list of SIPs to validate.
Windows (installed version):
C:\Program Files (x86)\HathiTrust SIP Validator\validate_sip C:\path\to\sip.zip C:\path\to\another_sip.zip
Because of limitations in Windows, you currently must provide the full path to the SIPs to validate (as well the configuration file, if one is provided; see below)
Windows (standalone exe)
validate_sip sip.zip sip2.zip
The standalone executable is slower to start up, but does not require installation and does not require specifying the complete path to each SIP.
Mac OS X and Linux:
bundle exec ruby bin/validate_sip /path/to/sip.zip /path/to/another_sip.zip
By default, validate_sip
will list all errors and warnings in the given SIP and output a summary with the number of warnings or errors. Errors will cause a SIP to fail ingest into HathiTrust and must be fixed before submission. Warnings point out things that could affect the display of material in HathiTrust, but will not prevent a SIP from being ingested.
Example output:
bundle exec ruby bin/validate_sip spec/fixtures/sips/bad_ocr.zip
bad_ocr.zip - WARN: MetaYml::PageOrder - Neither scanning_order or reading_order provided; they will default to left-to-right
bad_ocr.zip - WARN: MetaYml::PageData::Presence - 'pagedata' is not present in meta.yml; users will not have page tags or page numbers to navigate through this book.
bad_ocr.zip - WARN: OCR::CoordinatePresence - plain-text OCR file 00000001.txt has no corresponding coordinate OCR 00000001.{xml,html}
bad_ocr.zip - WARN: OCR::CoordinatePresence - plain-text OCR file 00000002.txt has no corresponding coordinate OCR 00000002.{xml,html}
bad_ocr.zip - ERROR: OCR::ControlChars - File 00000001.txt contains disallowed control characters
bad_ocr.zip - ERROR: OCR::UTF8 - File 00000002.txt is not valid UTF-8: invalid byte "\xC9" found.
bad_ocr.zip - Failure: 2 error(s), 4 warning(s)
Usage: validate_sip [options] sip1 sip2 ...
-c, --config=CONFIGPATH Path to the configuration.
-v, --verbose Show verbose output; overrides --quiet
-q, --quiet Show errors only (no warnings)
-h, --help Show this message
All checks are enabled by default. You might want to turn off some validators
that only produce warnings. You can do that for individual validators by
removing them from the configuration, or turn off warnings globally by using
the -q
option.
Particular validators you might want to disable:
-
MetaYml::PageData::Presence
- if you are not producing page tag / page number data -
OCR::CoordinatePresence
- if you are not producing coordinate OCR (e.g. ALTO, hOCR, etc in a .html or .xml file) -
MetaYml::PageOrder
- if all your material is scanned left-to-right and reads left-to-right
You can either edit config/default.yml
directly or make a copy for your
particular use case or set of content. For the installed windows version, you
can find the config at C:\Program Files (x86)\HathiTrust SIP Validator\src\config\default.yml
. For the standalone Windows executable,
download the default
config
and change it as you see fit.
- Does not yet validate any image technical characteristics
- Output is not very configurable
meta.yml generator by Ruth Tillman at the University of Notre Dame assists in generating SIPs which can then be validated with this tool.
We welcome pull requests as well as feedback to ingest@hathitrust.org
.