Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine based reading order integration #140

Open
wants to merge 72 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
5fdc6d4
integration of machine based reading order detection
vahidrezanezhad Oct 14, 2023
49c9314
machine based reading order inference with a variable batch size
vahidrezanezhad Oct 20, 2023
59c0d90
machine based reading order inference & optimized algorithm
vahidrezanezhad Oct 20, 2023
941d873
machine based reading order & works for not full layout case
vahidrezanezhad Oct 20, 2023
eac18c5
machine based reading order as an argument
vahidrezanezhad Dec 13, 2023
5144668
ocr engine first integration
vahidrezanezhad Jul 17, 2024
a62ae37
new full layout model and early layout for 1&2 column images are inte…
vahidrezanezhad Aug 7, 2024
be144db
updating 1&2 columns images + full layout
vahidrezanezhad Aug 7, 2024
00bf2b6
1&2 column images only printspace
vahidrezanezhad Aug 7, 2024
e976778
testing pyproject.toml
vahidrezanezhad Aug 14, 2024
53fd5fb
resolving #106 for pyproject.toml test
vahidrezanezhad Aug 14, 2024
4c50479
pyproject.toml may work for ocrd
vahidrezanezhad Aug 14, 2024
74eac4d
dtype = object in the case of length 1 arise error
vahidrezanezhad Aug 15, 2024
6f4205b
update pyproject.toml
vahidrezanezhad Aug 15, 2024
4f8210d
update Makefile model location
cneud Aug 15, 2024
c10a525
inference with batch size bigger than 1
vahidrezanezhad Aug 23, 2024
04e7900
making light version faster for 1 and 2 columns images
vahidrezanezhad Aug 24, 2024
7ae6a87
ignoring dpi check by light version
vahidrezanezhad Aug 26, 2024
9300595
inference batch size debugged
vahidrezanezhad Aug 27, 2024
0f87974
writing drop capitals in xml output + and may resolve issue #110
vahidrezanezhad Sep 2, 2024
c3a4a1b
resolving issue #110 in a better way
vahidrezanezhad Sep 3, 2024
f0b4907
adding option for textline detection in printspace
vahidrezanezhad Sep 3, 2024
2c93904
avoiding double binarization
vahidrezanezhad Sep 12, 2024
1b18ae8
passing number of columns as an argument
vahidrezanezhad Sep 12, 2024
21380fc
scaling contours without dilation
vahidrezanezhad Sep 17, 2024
a1f1f98
updating scaling contours
vahidrezanezhad Sep 17, 2024
5a07cd9
the most effective version of contours dilation without opencv and al…
vahidrezanezhad Sep 19, 2024
2d18739
postprocessing of textline contour dilation + skip layout and reading…
vahidrezanezhad Sep 20, 2024
b9e8959
update of light versions
vahidrezanezhad Sep 20, 2024
5d68013
updating light version
vahidrezanezhad Sep 20, 2024
7f08458
dilation of text regions without opencv
vahidrezanezhad Sep 21, 2024
62f8ae4
updating dilation of textlines and text regions
vahidrezanezhad Sep 23, 2024
6626dc6
updating textline dilation parameters
vahidrezanezhad Sep 23, 2024
b33739a
parametriyation in the case of textline contours dilation is accompli…
vahidrezanezhad Sep 24, 2024
95effe5
updating textregions dilation
vahidrezanezhad Sep 25, 2024
1330911
dilation of textregions and marginals are accomplished
vahidrezanezhad Sep 27, 2024
ad32316
updating light version
vahidrezanezhad Sep 27, 2024
1774076
updating light version. Remove textlines or textregion contours insid…
vahidrezanezhad Sep 30, 2024
ab63d5b
updating light version features
vahidrezanezhad Sep 30, 2024
543ed4b
-light version need -tll to be enabled otherwise the process will be …
vahidrezanezhad Oct 2, 2024
1da4b7f
updating light version
vahidrezanezhad Oct 7, 2024
3ef4eac
textlines of textregions are extracted in a faster way + early layout…
vahidrezanezhad Oct 17, 2024
f93fa12
doing more multiprocessing in order to make the process faster
vahidrezanezhad Oct 18, 2024
70772d4
binarization as a standalone command
vahidrezanezhad Oct 21, 2024
328d33e
Temporary commit – textline prediction without patches
vahidrezanezhad Oct 23, 2024
82281bd
fixing a bug occuring with reading order + Slro option with no patch …
vahidrezanezhad Oct 25, 2024
5037e98
Merge branch 'machine_based_reading_order_integration' of https://git…
vahidrezanezhad Oct 25, 2024
90ee2d6
textline segmentation is masked with drop capitals
vahidrezanezhad Oct 28, 2024
438df52
updating
vahidrezanezhad Oct 29, 2024
e796a99
updating inference for early layout in the case of documents with num…
vahidrezanezhad Oct 30, 2024
751b010
updating early layout inference for light version
vahidrezanezhad Nov 5, 2024
f7e5fb9
resolving merge conflict of machine based reading order and extractin…
vahidrezanezhad Nov 5, 2024
bceeeb5
Merge pull request #138 from qurator-spk/extracting_images_only
vahidrezanezhad Nov 5, 2024
6aee70d
Resolve merge conflict of main and machine based reading order branch
vahidrezanezhad Nov 5, 2024
0914b5f
resolve merge conflict of main branch with machine based reading ord…
vahidrezanezhad Nov 5, 2024
8409de0
sbb_binarization is integrated into eynollah works in framework of oc…
vahidrezanezhad Nov 10, 2024
1ae77e6
Update requirements.txt
cneud Nov 11, 2024
22b0b07
drop capital and marginals extraction is updated
vahidrezanezhad Nov 11, 2024
f43c49c
textlines of drop capitals are connected to corresponding textline if…
vahidrezanezhad Nov 13, 2024
ce5b611
tests are passed - new models by the way should be uploaded
vahidrezanezhad Nov 14, 2024
5fa8ca4
updating requirements
vahidrezanezhad Nov 14, 2024
d9f79c3
fixing IndexError by reading order detection
vahidrezanezhad Nov 18, 2024
b622494
new table detection model is integrated
vahidrezanezhad Nov 21, 2024
1746920
Update Makefile
vahidrezanezhad Nov 21, 2024
3000255
Update Makefile
vahidrezanezhad Nov 22, 2024
8014a9e
Update Makefile
vahidrezanezhad Nov 22, 2024
1083d1c
gha: try to free disk space
kba Nov 25, 2024
6aad006
filter textregions without textline
vahidrezanezhad Dec 2, 2024
871d7bf
fixed: machine based reading order cause tuple index out of range err…
vahidrezanezhad Dec 4, 2024
fbeef79
adding scatter_nd inference
vahidrezanezhad Dec 16, 2024
92bfac4
Provide OCR as an option to process a directory of XML files, incorpo…
vahidrezanezhad Dec 20, 2024
33fda2f
changing cnn ocr model name
vahidrezanezhad Dec 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/test-eynollah.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ jobs:
python-version: ['3.8', '3.9', '3.10', '3.11']

steps:
- name: clean up
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v4
- uses: actions/cache@v4
id: model_cache
Expand Down
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ models_eynollah: models_eynollah.tar.gz
models_eynollah.tar.gz:
# wget 'https://qurator-data.de/eynollah/2021-04-25/models_eynollah.tar.gz'
# wget 'https://qurator-data.de/eynollah/2022-04-05/models_eynollah_renamed.tar.gz'
# wget 'https://qurator-data.de/eynollah/2022-04-05/models_eynollah_renamed_savedmodel.tar.gz'
wget 'https://qurator-data.de/eynollah/2022-04-05/models_eynollah.tar.gz'
# wget 'https://github.com/qurator-spk/eynollah/releases/download/v0.3.0/models_eynollah.tar.gz'
wget 'https://github.com/qurator-spk/eynollah/releases/download/v0.3.1/models_eynollah.tar.gz'
# wget 'https://github.com/qurator-spk/eynollah/releases/download/v0.3.1/models_eynollah.tar.gz'

# Install with pip
install:
Expand All @@ -45,7 +45,7 @@ install-dev:
pip install -e .

smoke-test:
eynollah -i tests/resources/kant_aufklaerung_1784_0020.tif -o . -m $(PWD)/models_eynollah
eynollah layout -i tests/resources/kant_aufklaerung_1784_0020.tif -o . -m $(PWD)/models_eynollah

# Run unit tests
test:
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ classifiers = [
[project.scripts]
eynollah = "eynollah.cli:main"
ocrd-eynollah-segment = "eynollah.ocrd_cli:main"
ocrd-sbb-binarize = "eynollah.ocrd_cli_binarization:cli"

[project.urls]
Homepage = "https://github.com/qurator-spk/eynollah"
Expand Down
5 changes: 4 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
ocrd >= 2.23.3
numpy <1.24.0
scikit-learn >= 0.23.2
tensorflow == 2.12.1
tensorflow < 2.13
imutils >= 0.5.3
matplotlib
setuptools >= 50
transformers <= 4.30.2
torch <= 2.0.1
numba <= 0.58.1
201 changes: 173 additions & 28 deletions src/eynollah/cli.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,95 @@
import sys
import click
from ocrd_utils import initLogging, setOverrideLogLevel
from eynollah.eynollah import Eynollah
from eynollah.eynollah import Eynollah, Eynollah_ocr
from eynollah.sbb_binarize import SbbBinarizer

@click.group()
def main():
pass

@click.command()
@main.command()
@click.option(
"--dir_xml",
"-dx",
help="directory of GT page-xml files",
type=click.Path(exists=True, file_okay=False),
)

@click.option(
"--dir_out_modal_image",
"-domi",
help="directory where ground truth images would be written",
type=click.Path(exists=True, file_okay=False),
)

@click.option(
"--dir_out_classes",
"-docl",
help="directory where ground truth classes would be written",
type=click.Path(exists=True, file_okay=False),
)

@click.option(
"--input_height",
"-ih",
help="input height",
)
@click.option(
"--input_width",
"-iw",
help="input width",
)
@click.option(
"--min_area_size",
"-min",
help="min area size of regions considered for reading order training.",
)

def machine_based_reading_order(dir_xml, dir_out_modal_image, dir_out_classes, input_height, input_width, min_area_size):
xml_files_ind = os.listdir(dir_xml)

@main.command()
@click.option('--patches/--no-patches', default=True, help='by enabling this parameter you let the model to see the image in patches.')

@click.option('--model_dir', '-m', type=click.Path(exists=True, file_okay=False), required=True, help='directory containing models for prediction')

@click.argument('input_image')

@click.argument('output_image')
@click.option(
"--dir_in",
"-di",
help="directory of images",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--dir_out",
"-do",
help="directory where the binarized images will be written",
type=click.Path(exists=True, file_okay=False),
)

def binarization(patches, model_dir, input_image, output_image, dir_in, dir_out):
if not dir_out and (dir_in):
print("Error: You used -di but did not set -do")
sys.exit(1)
elif dir_out and not (dir_in):
print("Error: You used -do to write out binarized images but have not set -di")
sys.exit(1)
SbbBinarizer(model_dir).run(image_path=input_image, use_patches=patches, save=output_image, dir_in=dir_in, dir_out=dir_out)




@main.command()
@click.option(
"--image",
"-i",
help="image filename",
type=click.Path(exists=True, dir_okay=False),
)

@click.option(
"--out",
"-o",
Expand Down Expand Up @@ -140,36 +219,41 @@
help="if this parameter set to true, this tool would ignore page extraction",
)
@click.option(
"--log-level",
"--reading_order_machine_based/--heuristic_reading_order",
"-romb/-hro",
is_flag=True,
help="if this parameter set to true, this tool would apply machine based reading order detection",
)
@click.option(
"--do_ocr",
"-ocr/-noocr",
is_flag=True,
help="if this parameter set to true, this tool will try to do ocr",
)
@click.option(
"--num_col_upper",
"-ncu",
help="lower limit of columns in document image",
)
@click.option(
"--num_col_lower",
"-ncl",
help="upper limit of columns in document image",
)
@click.option(
"--skip_layout_and_reading_order",
"-slro/-noslro",
is_flag=True,
help="if this parameter set to true, this tool will ignore layout detection and reading order. It means that textline detection will be done within printspace and contours of textline will be written in xml output file.",
)
@click.option(
"--log_level",
"-l",
type=click.Choice(['OFF', 'DEBUG', 'INFO', 'WARN', 'ERROR']),
help="Override log level globally to this",
)
def main(
image,
out,
dir_in,
model,
save_images,
save_layout,
save_deskewed,
save_all,
extract_only_images,
save_page,
enable_plotting,
allow_enhancement,
curved_line,
textline_light,
full_layout,
tables,
right2left,
input_binary,
allow_scaling,
headers_off,
light_version,
ignore_page_extraction,
log_level
):

def layout(image, out, dir_in, model, save_images, save_layout, save_deskewed, save_all, extract_only_images, save_page, enable_plotting, allow_enhancement, curved_line, textline_light, full_layout, tables, right2left, input_binary, allow_scaling, headers_off, light_version, reading_order_machine_based, do_ocr, num_col_upper, num_col_lower, skip_layout_and_reading_order, ignore_page_extraction, log_level):
if log_level:
setOverrideLogLevel(log_level)
initLogging()
Expand All @@ -182,6 +266,8 @@ def main(
if textline_light and not light_version:
print('Error: You used -tll to enable light textline detection but -light is not enabled')
sys.exit(1)
if light_version and not textline_light:
print('Error: You used -light without -tll. Light version need light textline to be enabled.')
if extract_only_images and (allow_enhancement or allow_scaling or light_version or curved_line or textline_light or full_layout or tables or right2left or headers_off) :
print('Error: You used -eoi which can not be enabled alongside light_version -light or allow_scaling -as or allow_enhancement -ae or curved_line -cl or textline_light -tll or full_layout -fl or tables -tab or right2left -r2l or headers_off -ho')
sys.exit(1)
Expand All @@ -208,12 +294,71 @@ def main(
headers_off=headers_off,
light_version=light_version,
ignore_page_extraction=ignore_page_extraction,
reading_order_machine_based=reading_order_machine_based,
do_ocr=do_ocr,
num_col_upper=num_col_upper,
num_col_lower=num_col_lower,
skip_layout_and_reading_order=skip_layout_and_reading_order,
)
if dir_in:
eynollah.run()
else:
pcgts = eynollah.run()
eynollah.writer.write_pagexml(pcgts)


@main.command()
@click.option(
"--dir_in",
"-di",
help="directory of images",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--out",
"-o",
help="directory to write output xml data",
type=click.Path(exists=True, file_okay=False),
required=True,
)
@click.option(
"--dir_xmls",
"-dx",
help="directory of xmls",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--model",
"-m",
help="directory of models",
type=click.Path(exists=True, file_okay=False),
required=True,
)
@click.option(
"--tr_ocr",
"-trocr/-notrocr",
is_flag=True,
help="if this parameter set to true, transformer ocr will be applied, otherwise cnn_rnn model.",
)
@click.option(
"--log_level",
"-l",
type=click.Choice(['OFF', 'DEBUG', 'INFO', 'WARN', 'ERROR']),
help="Override log level globally to this",
)

def ocr(dir_in, out, dir_xmls, model, tr_ocr, log_level):
if log_level:
setOverrideLogLevel(log_level)
initLogging()
eynollah_ocr = Eynollah_ocr(
dir_xmls=dir_xmls,
dir_in=dir_in,
dir_out=out,
dir_models=model,
tr_ocr=tr_ocr,
)
eynollah_ocr.run()

if __name__ == "__main__":
main()
Loading
Loading