Bulk image extraction/conversion #20

simongray · 2021-09-10T07:23:54Z

Extract images from PDFs
~~Convert TIFFs to a better format~~
- Probably imagemagick

simongray · 2021-10-05T12:29:30Z

TIFF conversion

When it comes to conversion from TIFF, Imagemagick is definitely the way to go.

The TIFF scans are currently 25-30 MB in size. They can be easily converted into a different format using mogrify and it is reasonably fast, at least for JPEG:

mogrify -format jpg *.tif

The JPEG output will by default be around 1/10 the size of the TIFF files, e.g. 2.5 MB.
If PNG is used as the output format, the size is around half, e.g. 13 MB.

Quality

By adjusting the quality of the JPEG conversion, very little visual clarity is lost, but the default size of the JPEG output can be reduced significantly.

Using a quality of 85 will produce images that are about half the size of the default output (e.g. 1.3 MB), while 80 produces even smaller files at around 0.9 to 1.1 MB in size - and the artefacts are really hard to spot. Setting quality to 70 produces files at around ~800-900 KB size without really visible JPEG artefacts.

Putting quality all the way down to 50, the artefacts do start to become noticeable in certain places when zooming in. At this level, the output file size is ~500-650 KB.

I will need to sample some more output files, but for now I'm going with 70 as the preferred output quality:

mogrify -format jpg -quality 70 *.tif

Dimensions

The size of the images is not being adjusted, but I am also not sure if the input file size will always be uniform anyway. It doesn't matter much since modern browsers scale images fine anyway.

simongray · 2022-01-21T09:54:36Z

The command can be performed recursively into subdirectories by running

find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} +

Since this works fine, but doesn't allow us to set the output filename (we want to append with .jpg, not replace .tif with .jpg), we also need to run

find . -name '*.jpg' -exec rename -n 's/(?<!.tif).jpg/.tif.jpg/g' {} +

this will idempotently rename the filenames of every *.jpg file to *.tif.jpg, meaning it won't happen with files already ending in .tif.jpg.

simongray · 2022-01-21T13:12:08Z

PDF extraction

I installed pdfimages through brew install poppler. I haven't run it recursively, just for a single file at a time. However it seems that all of the PDF input files are layered in such a way that a single PDF has 3 different layers, resulting in 3 extracted images. The first two are essentially background details, while the third has the text details in inverted colours.

# single PDF example, results in 3 files
pdfimages -l 1 -j -png /Users/rqf595/Desktop/Data-FINAL/Udgivelser-mangler/Faksimile\,pdf/TCLC_5_0.pdf TCLC_5_0

Now, I either need to

spend time figuring out how to merge the resulting files
only use the third layer, inverting it using imagemagick: convert -negate /Users/rqf595/Desktop/Data-FINAL/TCLC_5_0-002.png
Give up and convert rathert han extract.

For now I find it most likely that I will simply do 3.

PDF conversion

Also using the poppler package, although a different command, converting a single file:

pdftoppm /Users/rqf595/Desktop/Data-FINAL/Udgivelser-mangler/Faksimile\,pdf/TCLC_5_0.pdf john -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile

-rx and -ry sets the DPI, while -jpegopt quality=70 sets the quality to 70.

To do it recursively for every single PDF file, run

find . -name '*.pdf' -exec pdftoppm {} {} -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile \;

simongray · 2022-01-21T15:13:36Z

The final result from running the 3 commands in the root directory will be that every PDF file and every TIFF file has been converted into a file with the same filename appended with .jpg.

find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} +
find . -name '*.jpg' -exec rename -n 's/(?<!.tif).jpg/.tif.jpg/g' {} +
find . -name '*.pdf' -exec pdftoppm {} {} -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile \;

NOTE: in the rename command above the dry-run flag, -n, should be removed when doing the actual renaming.

Once the files have been converted, the originals may be deleted using

find . -name "*.tif" -type f -delete

However, doing this is probably risky for the PDFs, since some of them are supposedly whole books, so that part probably needs to be done manually to some extent.

simongray · 2022-01-21T15:39:06Z

Closing this for now, although the three above commands will eventually have to be made part of some kind of preprocessing step on the production server too.

- bootstrap db on service creation - fetch JPG versions of both PDF and TIFF files, see #20

simongray closed this as completed Jan 21, 2022

simongray added a commit that referenced this issue Jan 21, 2022

use Asami to provide files #18

655501f

- bootstrap db on service creation - fetch JPG versions of both PDF and TIFF files, see #20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk image extraction/conversion #20

Bulk image extraction/conversion #20

simongray commented Sep 10, 2021 •

edited

Loading

simongray commented Oct 5, 2021 •

edited

Loading

simongray commented Jan 21, 2022 •

edited

Loading

simongray commented Jan 21, 2022 •

edited

Loading

simongray commented Jan 21, 2022 •

edited

Loading

simongray commented Jan 21, 2022

Bulk image extraction/conversion #20

Bulk image extraction/conversion #20

Comments

simongray commented Sep 10, 2021 • edited Loading

simongray commented Oct 5, 2021 • edited Loading

TIFF conversion

Quality

Dimensions

simongray commented Jan 21, 2022 • edited Loading

simongray commented Jan 21, 2022 • edited Loading

PDF extraction

PDF conversion

simongray commented Jan 21, 2022 • edited Loading

simongray commented Jan 21, 2022

simongray commented Sep 10, 2021 •

edited

Loading

simongray commented Oct 5, 2021 •

edited

Loading

simongray commented Jan 21, 2022 •

edited

Loading

simongray commented Jan 21, 2022 •

edited

Loading

simongray commented Jan 21, 2022 •

edited

Loading