----- not a real readme yet... TODO
To create the ground truth do: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output181.csv 181
- or for page 180: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180
- with check_numeric on: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -ch
- with printing on: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -p
- with timing on (not yet implemented): -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -p -t
To create the ocr csv file:
- For Transkribus ALTO output do: -- python main.py -c ocr_parser input/OCR/gort004mei_03_01_1-20_175-190/alto/ -o output/ocr_output.csv 180
- For Tesseract ALTO output, where one image is used as input, do: -- python main.py ocr_file_parser input/OCR/out5_12_p180_jpg.xml -o output/ocr_output_file.csv
- For Tesseract ALTO output, where input was a TIFF file with multiple pages, do: -- python main.py -c ocr_parser_page input/OCR/out9_12_p1-20_175-190_tiff.xml -o output/ocr_output_p180_tiff_9_12.csv 180
- or: -- python main.py ocr_parser_page input/OCR/gort004mei_03_01_full_16_12.xml -o output/ocr_output_p180_tiff_9_12.csv 180
To perform the mapping on a range of pages from a document:
-
python main.py all_pages_mapping input/transcriptions/gort004mei_03_01.txt input/OCR/gort004mei_03_01_full_16_12.xml 180 180
-
with output:
-
python main.py all_pages_mapping -o final -dir output/matching_output -fn gort004mei_03_01 input/transcriptions/gort004mei_03_01.txt input/OCR/gort004mei_03_01_full_16_12.xml 180 180 -- python main.py all_pages_mapping [TXT INPUT] [ALTO INPUT] [START PAGE] [END PAGE]
-
with output options, directly after 'all_pages_mapping': -- -o [output option] -- -dir [OUTPUT DIRECTORY] -- -fn [OUTPUT FILENAME] -- -sf
-
for the three above, the follwing options can be used: -- to change the pagenumber in a specific range: -c -- to add a column with check_numeric booleans: -ch
To perform the matching with previously created csv files as input: -- python main.py csv_parser output/ocr_output.csv output/tr_output.csv
- The following options can be used, type these right behind main.py: -- to specify sets of characters that should be considered as the same character: ---- -sise input/configs/basic_sim_sets.txt -- to ignore differences in capitalization when matching: --- -icap -- to make matching between two numeric 'words' more restrained: --- -ch -- to allow n words from the transcription to be joined into one word in the ocr, or viceversa. (larger splits/joins): --- -cnw 8 -- number of matchings to execute: --- -ma [NUM MATCH] (def.=1) -- to also include "bad matches" in any further matchings, with a distance above a float from 0-1: --- -inma 0.4
For all:
- printing with [-p [choice]]*. If nothing is entered, only 'updates' is turned on. Choices are: -- 'zen': to not print anything -- 'updates': to print updates on the executing of the program -- 'tr_wl': to print the transcription wordlist when created or retreived -- 'ocr_wl': to print the ocr wordlist when created or retreived -- 'interm_matching': to print results of all the intermediate matchings -- 'leftover_ind': to print indices of the transcription and ocr words that are not matched well enough yet -- 'final': to print the final matching result -- 'stats': to print some statistics about each matching -- 'all': to print all previously mentioned
- timing with -t
(old) To do the full process all in one: -- full_parser