About

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

There is a Public Specification for the hOCR Format.

Available Programs

Included command line programs:

hocr-check -- check the hOCR file for errors
hocr-combine -- combine pages in multiple hOCR files into a single document
hocr-eval -- compute number of segmentation and OCR errors
hocr-eval-geom -- compute over, under, and mis-segmentations
hocr-eval-lines -- compute OCR errors of hOCR output relative to text ground truth
hocr-split -- split an hOCR file into individual pages
hocr-merge-dc -- merge Dublin Core meta data into the hOCR HTML header

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README		README
README.md		README.md
dcsample.xml		dcsample.xml
dcsample2.xml		dcsample2.xml
hocr-check		hocr-check
hocr-combine		hocr-combine
hocr-eval		hocr-eval
hocr-eval-geom		hocr-eval-geom
hocr-eval-lines		hocr-eval-lines
hocr-extract-g1000		hocr-extract-g1000
hocr-extract-images		hocr-extract-images
hocr-lines		hocr-lines
hocr-merge-dc		hocr-merge-dc
hocr-pdf		hocr-pdf
hocr-split		hocr-split
sample.html		sample.html
sample.txt		sample.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Available Programs

About

Releases

Packages

Languages

grassit/hocr-tools

Folders and files

Latest commit

History

Repository files navigation

About

Available Programs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages