Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk image extraction/conversion #20

Closed
2 tasks done
simongray opened this issue Sep 10, 2021 · 5 comments
Closed
2 tasks done

Bulk image extraction/conversion #20

simongray opened this issue Sep 10, 2021 · 5 comments

Comments

@simongray
Copy link
Member

simongray commented Sep 10, 2021

@simongray
Copy link
Member Author

simongray commented Oct 5, 2021

TIFF conversion

When it comes to conversion from TIFF, Imagemagick is definitely the way to go.

The TIFF scans are currently 25-30 MB in size. They can be easily converted into a different format using mogrify and it is reasonably fast, at least for JPEG:

mogrify -format jpg *.tif
  • The JPEG output will by default be around 1/10 the size of the TIFF files, e.g. 2.5 MB.
  • If PNG is used as the output format, the size is around half, e.g. 13 MB.

Quality

By adjusting the quality of the JPEG conversion, very little visual clarity is lost, but the default size of the JPEG output can be reduced significantly.

Using a quality of 85 will produce images that are about half the size of the default output (e.g. 1.3 MB), while 80 produces even smaller files at around 0.9 to 1.1 MB in size - and the artefacts are really hard to spot. Setting quality to 70 produces files at around ~800-900 KB size without really visible JPEG artefacts.

Putting quality all the way down to 50, the artefacts do start to become noticeable in certain places when zooming in. At this level, the output file size is ~500-650 KB.

I will need to sample some more output files, but for now I'm going with 70 as the preferred output quality:

mogrify -format jpg -quality 70 *.tif

Dimensions

The size of the images is not being adjusted, but I am also not sure if the input file size will always be uniform anyway. It doesn't matter much since modern browsers scale images fine anyway.

@simongray
Copy link
Member Author

simongray commented Jan 21, 2022

The command can be performed recursively into subdirectories by running

find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} +

Since this works fine, but doesn't allow us to set the output filename (we want to append with .jpg, not replace .tif with .jpg), we also need to run

find . -name '*.jpg' -exec rename -n 's/(?<!.tif).jpg/.tif.jpg/g' {} +

this will idempotently rename the filenames of every *.jpg file to *.tif.jpg, meaning it won't happen with files already ending in .tif.jpg.

@simongray
Copy link
Member Author

simongray commented Jan 21, 2022

PDF extraction

I installed pdfimages through brew install poppler. I haven't run it recursively, just for a single file at a time. However it seems that all of the PDF input files are layered in such a way that a single PDF has 3 different layers, resulting in 3 extracted images. The first two are essentially background details, while the third has the text details in inverted colours.

# single PDF example, results in 3 files
pdfimages -l 1 -j -png /Users/rqf595/Desktop/Data-FINAL/Udgivelser-mangler/Faksimile\,pdf/TCLC_5_0.pdf TCLC_5_0

Now, I either need to

  1. spend time figuring out how to merge the resulting files
  2. only use the third layer, inverting it using imagemagick: convert -negate /Users/rqf595/Desktop/Data-FINAL/TCLC_5_0-002.png
  3. Give up and convert rathert han extract.

For now I find it most likely that I will simply do 3.

PDF conversion

Also using the poppler package, although a different command, converting a single file:

pdftoppm /Users/rqf595/Desktop/Data-FINAL/Udgivelser-mangler/Faksimile\,pdf/TCLC_5_0.pdf john -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile

-rx and -ry sets the DPI, while -jpegopt quality=70 sets the quality to 70.

To do it recursively for every single PDF file, run

find . -name '*.pdf' -exec pdftoppm {} {} -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile \;

@simongray
Copy link
Member Author

simongray commented Jan 21, 2022

The final result from running the 3 commands in the root directory will be that every PDF file and every TIFF file has been converted into a file with the same filename appended with .jpg.

find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} +
find . -name '*.jpg' -exec rename -n 's/(?<!.tif).jpg/.tif.jpg/g' {} +
find . -name '*.pdf' -exec pdftoppm {} {} -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile \;

NOTE: in the rename command above the dry-run flag, -n, should be removed when doing the actual renaming.

Once the files have been converted, the originals may be deleted using

find . -name "*.tif" -type f -delete

However, doing this is probably risky for the PDFs, since some of them are supposedly whole books, so that part probably needs to be done manually to some extent.

@simongray
Copy link
Member Author

Closing this for now, although the three above commands will eventually have to be made part of some kind of preprocessing step on the production server too.

simongray added a commit that referenced this issue Jan 21, 2022
- bootstrap db on service creation
- fetch JPG versions of both PDF and TIFF files, see #20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant