GitHub - KBNLresearch/transcription-ocr-mapping: A tool that maps a transcription file onto an existing ALTO file of the same scan to create OCR-like output with the original transcription.

----- not a real readme yet... TODO

To create the ground truth do: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output181.csv 181

or for page 180: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180
with check_numeric on: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -ch
with printing on: -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -p
with timing on (not yet implemented): -- python main.py txt_parser input/transcriptions/gort004mei_03_01.txt -o output/tr_output180.csv 180 -p -t

To create the ocr csv file:

For Transkribus ALTO output do: -- python main.py -c ocr_parser input/OCR/gort004mei_03_01_1-20_175-190/alto/ -o output/ocr_output.csv 180
For Tesseract ALTO output, where one image is used as input, do: -- python main.py ocr_file_parser input/OCR/out5_12_p180_jpg.xml -o output/ocr_output_file.csv
For Tesseract ALTO output, where input was a TIFF file with multiple pages, do: -- python main.py -c ocr_parser_page input/OCR/out9_12_p1-20_175-190_tiff.xml -o output/ocr_output_p180_tiff_9_12.csv 180
or: -- python main.py ocr_parser_page input/OCR/gort004mei_03_01_full_16_12.xml -o output/ocr_output_p180_tiff_9_12.csv 180

To perform the mapping on a range of pages from a document:

python main.py all_pages_mapping input/transcriptions/gort004mei_03_01.txt input/OCR/gort004mei_03_01_full_16_12.xml 180 180
with output:
python main.py all_pages_mapping -o final -dir output/matching_output -fn gort004mei_03_01 input/transcriptions/gort004mei_03_01.txt input/OCR/gort004mei_03_01_full_16_12.xml 180 180 -- python main.py all_pages_mapping [TXT INPUT] [ALTO INPUT] [START PAGE] [END PAGE]
with output options, directly after 'all_pages_mapping': -- -o [output option] -- -dir [OUTPUT DIRECTORY] -- -fn [OUTPUT FILENAME] -- -sf
for the three above, the follwing options can be used: -- to change the pagenumber in a specific range: -c -- to add a column with check_numeric booleans: -ch

To perform the matching with previously created csv files as input: -- python main.py csv_parser output/ocr_output.csv output/tr_output.csv

The following options can be used, type these right behind main.py: -- to specify sets of characters that should be considered as the same character: ---- -sise input/configs/basic_sim_sets.txt -- to ignore differences in capitalization when matching: --- -icap -- to make matching between two numeric 'words' more restrained: --- -ch -- to allow n words from the transcription to be joined into one word in the ocr, or viceversa. (larger splits/joins): --- -cnw 8 -- number of matchings to execute: --- -ma [NUM MATCH] (def.=1) -- to also include "bad matches" in any further matchings, with a distance above a float from 0-1: --- -inma 0.4

For all:

printing with [-p [choice]]*. If nothing is entered, only 'updates' is turned on. Choices are: -- 'zen': to not print anything -- 'updates': to print updates on the executing of the program -- 'tr_wl': to print the transcription wordlist when created or retreived -- 'ocr_wl': to print the ocr wordlist when created or retreived -- 'interm_matching': to print results of all the intermediate matchings -- 'leftover_ind': to print indices of the transcription and ocr words that are not matched well enough yet -- 'final': to print the final matching result -- 'stats': to print some statistics about each matching -- 'all': to print all previously mentioned
timing with -t

(old) To do the full process all in one: -- full_parser

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
documentation		documentation
input		input
output		output
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
command_generator.py		command_generator.py
commandline_helper.py		commandline_helper.py
csv_handler.py		csv_handler.py
from_commandline.py		from_commandline.py
m_word_lists.py		m_word_lists.py
main.py		main.py
match.py		match.py
matching.py		matching.py
ocr_word_list.py		ocr_word_list.py
s_word_lists.py		s_word_lists.py
transcription_word_list.py		transcription_word_list.py
tri_matrix.py		tri_matrix.py
word.py		word.py
word_list.py		word_list.py

Provide feedback