A proposal for post-OCR spelling correction using Language Models

Fine-tuned models: HuggingFace
Library for create synthetic OCR errors: NoisOCR

Abstract:

This work explores the use of Language Models (LMs) to correct residual errors in texts extracted by OCR and HTR (Handwritten Text Recognition) systems. We propose a general approach but utilize the images from Brazilian handwritten essays of the BRESSAY dataset as a use case. Two standard LMs (Bart and ByT5) and two LLMs (LLama 1 and LLama 2) were evaluated in this context. The results indicate that the smaller LMs outperformed the LLMs in terms of error rate reduction (CER and WER). Traditional correction methods, such as Symspell and Norvig, were influential in some cases but fell short of the results obtained by the LMs. ByT5 with byte-level tokenization improved CER and WER, proving performance for texts with high noise. As a result, smaller LMs, after fine-tuning, are more efficient and cheaper for post-OCR corrections. We identify and propose promising future studies involving correction at broader levels of context, such as paragraphs.

Methodology:

Results:

Citation:

@inproceedings{
  araujo2024a,
  title={A proposal for post-{OCR} spelling correction using Language Models},
  author={S{\'a}vio Santos de Ara{\'u}jo and Byron Leite Dantas Bezerra and Arthur Flor de Sousa Neto and Cleber Zanchettin},
  booktitle={Latinx in AI @ NeurIPS 2024},
  year={2024},
  url={https://openreview.net/forum?id=p5P9R9AKr5}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
corrections		corrections
finetuned_models		finetuned_models
metrics		metrics
scripts		scripts
test_data		test_data
train_data		train_data
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test_ground_truth.csv		test_ground_truth.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A proposal for post-OCR spelling correction using Language Models

Abstract:

Methodology:

Results:

Citation:

About

Releases

Packages

Languages

License

savi8sant8s/ptbr-post-ocr-sc-llm

Folders and files

Latest commit

History

Repository files navigation

A proposal for post-OCR spelling correction using Language Models

Abstract:

Methodology:

Results:

Citation:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages