-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk image extraction/conversion #20
Comments
TIFF conversionWhen it comes to conversion from TIFF, Imagemagick is definitely the way to go. The TIFF scans are currently 25-30 MB in size. They can be easily converted into a different format using mogrify and it is reasonably fast, at least for JPEG: mogrify -format jpg *.tif
QualityBy adjusting the quality of the JPEG conversion, very little visual clarity is lost, but the default size of the JPEG output can be reduced significantly. Using a quality of Putting quality all the way down to I will need to sample some more output files, but for now I'm going with mogrify -format jpg -quality 70 *.tif DimensionsThe size of the images is not being adjusted, but I am also not sure if the input file size will always be uniform anyway. It doesn't matter much since modern browsers scale images fine anyway. |
The command can be performed recursively into subdirectories by running find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} + Since this works fine, but doesn't allow us to set the output filename (we want to append with find . -name '*.jpg' -exec rename -n 's/(?<!.tif).jpg/.tif.jpg/g' {} + this will idempotently rename the filenames of every |
PDF extractionI installed # single PDF example, results in 3 files
pdfimages -l 1 -j -png /Users/rqf595/Desktop/Data-FINAL/Udgivelser-mangler/Faksimile\,pdf/TCLC_5_0.pdf TCLC_5_0 Now, I either need to
For now I find it most likely that I will simply do 3. PDF conversionAlso using the poppler package, although a different command, converting a single file: pdftoppm /Users/rqf595/Desktop/Data-FINAL/Udgivelser-mangler/Faksimile\,pdf/TCLC_5_0.pdf john -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile
To do it recursively for every single PDF file, run find . -name '*.pdf' -exec pdftoppm {} {} -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile \; |
The final result from running the 3 commands in the root directory will be that every PDF file and every TIFF file has been converted into a file with the same filename appended with find . -name '*.tif' -exec mogrify -format jpg -quality 70 {} +
find . -name '*.jpg' -exec rename -n 's/(?<!.tif).jpg/.tif.jpg/g' {} +
find . -name '*.pdf' -exec pdftoppm {} {} -jpeg -rx 300 -ry 300 -jpegopt quality=70 -f 1 -singlefile \;
Once the files have been converted, the originals may be deleted using find . -name "*.tif" -type f -delete However, doing this is probably risky for the PDFs, since some of them are supposedly whole books, so that part probably needs to be done manually to some extent. |
Closing this for now, although the three above commands will eventually have to be made part of some kind of preprocessing step on the production server too. |
- bootstrap db on service creation - fetch JPG versions of both PDF and TIFF files, see #20
Convert TIFFs to a better formatThe text was updated successfully, but these errors were encountered: