Java command line tool to create PDF files from image derivates with configurable scales and qualities.
Optional appends image footer or structured OCR formats (ALTO, PAGE) to produce text layers and metadata (METS/MODS) to reflect logical order.
Uses mets-model for METS/MODS-handling, iText-java to create PDF, Apache log4j2 for logging and a workflow inspired by OCR-D/Core Workflows.
Create PDF from scaled image data (optional: footer) and constraints on compression rate and max
sizes.
For details see configuration section.
If metadata (METS/MODS) is available, the following will be taken into account:
- Value
mods:recordIdentifier[@source]
to name PDF artefact - Value
mods:titleInfo/mods:title
for internal naming - Attribute
mets:div[@ORDER]
for file containers as defined in the METS physical structMap to create a PDF outline - Attribute
mets:div[@CONTENTIDS]
(granular URN) will be rendered for each page if footer shall be appended to each page image
Pull the Docker image:
docker pull ghcr.io/ulb-sachsen-anhalt/digital-derivans:latest
or build it your own locally:
./scripts/build_docker_image.sh
Usage of docker image is described in Usage section, but all required directories / files need to be passed as mapped volumes.
For example:
docker run \
--mount type=bind,source=<host-work-dir>,target=/data-print \
--mount type=bind,source=<host-config-dir>,target=/data-config \
--mount type=bind,source=<host-log-dir>,target=/data-log \
ghcr.io/ulb-sachsen-anhalt/digital-derivans \
<print-dir|mets-file> -c /data-config/derivans.ini
Digital Derivans is a Java 11+ project build with Apache Maven.
- OpenJDK 11+
- Maven 3.6+
- git 2.12+
Clone the repository and call Maven to trigger the build process, but be aware, that a recent OpenJDK is required.
git clone git@github.com:ulb-sachsen-anhalt/digital-derivans.git
cd digital-derivans
mvn clean package
This will finally create a shaded JAR ("FAT-JAR") inside the build directory (./target/digital-derivans-<version>.jar
)
In local mode, a recent OpenJRE is required.
The tool expects a project folder containing an image directory (default: MAX
) and optional OCR-data directory (
default: FULLTEXT
').
The default name of the generated PDF inside is derived from the object's folder name or can be set with -n
-arg.
A sample folder structure:
my_print/
├── FULLTEXT
│ ├── 0002.xml
│ ├── 0021.xml
│ ├── 0332.xml
├── MAX
│ ├── 0002.tif
│ ├── 0021.tif
│ ├── 0332.tif
Running
java -jar <PATH>./target/digital-derivans-<version>.jar <path-to-my_print>`
will produce a file named my_print.pdf
in the my_print
directory from above with specified layout.
For more information concerning CLI-Usage, please consult CLI docs.
Although Derivans can be run without configuration, it's strongly recommended. Many flags, especially if metadata must be taken into account, are using defaults tied to digitization workflows of ULB Sachsen-Anhalt that might not fit your custom requirements.
Configuration options can be bundled into sections and customized with a INI-file.
Some params can be set on global level, like quality and poolsize.
Each section in a *.ini
- file matching [derivate_<n>]
represents a single derivate section for intermediate or final
derivates.
Order of execution is determined by pairs of input-output paths, whereas numbering of derivate sections determines order at parse-time.
On top of the INI-file are configuration values listed, which will be used as defaults for actual steps, if they can be applied.
default_quality
: image data compression rate (can be specified withquality
for image derivate sections)default_poolsize
: poolsize of worker threads for parallel processing (can be specified withpoolsize
for image derivate sections)
Some options values must be set individually for each step:
input_dir
: path to directory with section input filesoutput_dir
: path to directory for section output
Additional options can be set, according to of the actual type to derive:
Images:
quality
: compression ratepoolsize
: parallel workersmaximal
: maximal dimension (affects both width and height)footer_template
: footer template Pathfooter_label_copyright
: additional (static) label for footer
PDF:
metadata_creator
: enrich creator tagmetadata_keywords
: enrich keywordsenrich_pdf_metadata
: if PDF shall be enriched into METS/MODS (default:True
)mods_identifier_xpath
: if not set, usemods:recordIdentifier
from primary MODSmets_filegroup_fulltext
: METS-filegroup for OCR-Data (default:FULLTEXT
)mets_filegroup_images
: METS-filegroup for image data (default:MAX
)
The following example configuration contains global settings and subsequent generation steps.
(Example directory and file layout like from Usage section assumed.)
On global level, it sets the default JPEG-quality to 75
, the number of parallel executors to 4
(recommended if at
least 4 CPUs available) and determines the file for the logging-configuration.
- Create JPEG images from images in sub directory
MAX
with compression rate 75, scale to maximal dimension 1000px and store in sub dirIMAGE_75
. - Create PDF with images from
IMAGE_75
, add some PDF metadata and store file asmy_print.pdf
in current dir.
default_quality = 75
default_poolsize = 4
logger_configuration_file = derivans_logging.xml
[derivate_01]
input_dir = MAX
output_dir = IMAGE_75
maximal = 1000
[derivate_02]
type = pdf
input_dir = IMAGE_75
output_dir = .
output_type = pdf
metadata_creator = "<your organization label>"
metadata_license = "Public Domain Mark 1.0"
The main parameter for Derivans is the input path, which may be a local directory in local mode or the path to a local METS/MODS-file with sub directories for images and OCR-data, if using metadata.
Additionally, one can also provide via CLI
-c
path to custom configuration INI-file-d
flag to turn on rendering of boxes and text if using OCR input-n
set custom name for resulting PDF- set labels for OCR and input-image (will overwrite configuration)
If metadata present, both will be used as filegroup names; For images they will also be used as input directory for initial image processing
Derivans depends on standard JDK11-components and external components for image processing and PDF generation.
- Subsequent derivate steps must not have order gaps, since the parsing is done step by step. Otherwise, any derivate section after the first gap will be ignored, which may lead to unexpected results.
Please note:
To overcome javax.imageio
errors, it's recommended to fix them using an external image processing application.
- Images with more than 8bit channel depth can't be processed javax.imageio.IIOException: Illegal band size
- Uncommon image metadata can't be processed
javax.imageio.IIOException: Unsupported marker - Integral dimension values required for proper scaling javax.imageio.metadata.IIOInvalidTreeException: Xdensity attribute out of range
- If Derivans is called from within the project folder, the resulting pdf will be called
..pdf
. - iText PDF-Library limits the maximal page dimension to 14400 px ( weight/height, Configured max dimension fails for very large Images). This may cause trouble if one needs to generate PDF for very large prints like maps, deeds or scrolls.
- Derivans does not accept METS with current OCR-D-style nor any other METS which contains extended XML-features like inline namespace declarations.
This project's source code is licensed under terms of the MIT license.
NOTE: This project depends on components that may use different license terms.