Skip to content

A tool that maps a transcription file onto an existing ALTO file of the same scan to create OCR-like output with the original transcription.

Notifications You must be signed in to change notification settings

KBNLresearch/transcription-ocr-mapping

Repository files navigation

----- not a real readme yet... TODO

To create the ground truth do: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output181.csv 181

  • or for page 180: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180
  • with check_numeric on: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -ch
  • with printing on: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -p
  • with timing on (not yet implemented): -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -p -t

To create the ocr csv file:

  • For Transkribus ALTO output do: -- python main.py -c ocr_parser input/OCR/gort004mei_03_01_1-20_175-190/alto/ -o output/ocr_output.csv 180
  • For Tesseract ALTO output, where one image is used as input, do: -- python main.py ocr_file_parser input/OCR/out5_12_p180_jpg.xml -o output/ocr_output_file.csv
  • For Tesseract ALTO output, where input was a TIFF file with multiple pages, do: -- python main.py -c ocr_parser_page input/OCR/out9_12_p1-20_175-190_tiff.xml -o output/ocr_output_p180_tiff_9_12.csv 180
  • or: -- python main.py ocr_parser_page input/OCR/gort004mei_03_01_full_16_12.xml -o output/ocr_output_p180_tiff_9_12.csv 180

To perform the mapping on a range of pages from a document:

  • python main.py all_pages_mapping input/transcriptions/gort004mei_03_01.txt input/OCR/gort004mei_03_01_full_16_12.xml 180 180

  • with output:

  • python main.py all_pages_mapping -o final -dir output/matching_output -fn gort004mei_03_01 input/transcriptions/gort004mei_03_01.txt input/OCR/gort004mei_03_01_full_16_12.xml 180 180 -- python main.py all_pages_mapping [TXT INPUT] [ALTO INPUT] [START PAGE] [END PAGE]

  • with output options, directly after 'all_pages_mapping': -- -o [output option] -- -dir [OUTPUT DIRECTORY] -- -fn [OUTPUT FILENAME] -- -sf

  • for the three above, the follwing options can be used: -- to change the pagenumber in a specific range: -c -- to add a column with check_numeric booleans: -ch

To perform the matching with previously created csv files as input: -- python main.py csv_parser output/ocr_output.csv output/tr_output.csv

  • The following options can be used, type these right behind main.py: -- to specify sets of characters that should be considered as the same character: ---- -sise input/configs/basic_sim_sets.txt -- to ignore differences in capitalization when matching: --- -icap -- to make matching between two numeric 'words' more restrained: --- -ch -- to allow n words from the transcription to be joined into one word in the ocr, or viceversa. (larger splits/joins): --- -cnw 8 -- number of matchings to execute: --- -ma [NUM MATCH] (def.=1) -- to also include "bad matches" in any further matchings, with a distance above a float from 0-1: --- -inma 0.4

For all:

  • printing with [-p [choice]]*. If nothing is entered, only 'updates' is turned on. Choices are: -- 'zen': to not print anything -- 'updates': to print updates on the executing of the program -- 'tr_wl': to print the transcription wordlist when created or retreived -- 'ocr_wl': to print the ocr wordlist when created or retreived -- 'interm_matching': to print results of all the intermediate matchings -- 'leftover_ind': to print indices of the transcription and ocr words that are not matched well enough yet -- 'final': to print the final matching result -- 'stats': to print some statistics about each matching -- 'all': to print all previously mentioned
  • timing with -t

(old) To do the full process all in one: -- full_parser

About

A tool that maps a transcription file onto an existing ALTO file of the same scan to create OCR-like output with the original transcription.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages