Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR #2505

Merged
merged 18 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
bd062da
Add Japanese Vertical Support Branch
tenpai-git Feb 20, 2024
ea870f9
Merge remote-tracking branch 'eikek/current-docs' into add-japanese-v…
tenpai-git Mar 17, 2024
e6bba49
Merge remote-tracking branch 'upstream/master' into add-japanese-vert…
tenpai-git Mar 17, 2024
145c44b
Fixing translation of JpnSenk to JpnVert
tenpai-git Mar 20, 2024
915bd34
Post-sbt-fix Run of JpnVert
tenpai-git Mar 20, 2024
3c67443
Removing extraneous tailwindcss binary.
tenpai-git Mar 20, 2024
53b1de3
Removing npm dependencies from testing
tenpai-git Mar 20, 2024
db02b0f
More NPM cleanup and comment fixes.
tenpai-git Mar 21, 2024
949aa51
Adds Japanese Vertical mappings to default configuration.
tenpai-git Mar 24, 2024
56167c2
Variable Renaming, minor bug fix.
tenpai-git Mar 24, 2024
02dfcf2
Shows language selection in custom mappings, makes Eikek a happy main…
tenpai-git Mar 29, 2024
196a678
Shows language selection in custom mappings, makes Eikek a happy main…
tenpai-git Mar 29, 2024
2a94966
Merge remote-tracking branch 'refs/remotes/origin/add-japanese-vertic…
tenpai-git Apr 1, 2024
01e01cc
Merge branch 'master' of https://github.com/eikek/docspell into add-j…
tenpai-git Apr 1, 2024
af1ab9a
Merge branch 'eikek:master' into add-japanese-vertical-support
tenpai-git Apr 4, 2024
7c1a22f
Fixing client dropdown.
tenpai-git Apr 16, 2024
26c92a6
Merge branch 'add-japanese-vertical-support' of https://github.com/te…
tenpai-git Apr 16, 2024
97fed73
Merge branch 'master' of https://github.com/eikek/docspell into add-j…
tenpai-git Apr 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/dockerfiles/joex.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ RUN \
wget https://github.com/tesseract-ocr/tessdata/raw/main/khm.traineddata && \
mv khm.traineddata /usr/share/tessdata

# Using these data files for japanese, because they work better. See #973
# Using these data files for japanese, because they work better. Includes vertical data. See #973 and #2445.
RUN \
wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_fast/master/jpn_vert.traineddata && \
wget https://raw.githubusercontent.com/tesseract-ocr/tessdata_fast/master/jpn.traineddata && \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ object DateFind {
case Language.Dutch => dmy.or(ymd).or(mdy)
case Language.Latvian => dmy.or(lavLong).or(ymd)
case Language.Japanese => ymd
case Language.JpnVert => ymd
case Language.Hebrew => dmy
case Language.Lithuanian => ymd
case Language.Polish => dmy
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ object MonthName {
latvian
case Language.Japanese =>
japanese
case Language.JpnVert =>
japanese
case Language.Hebrew =>
hebrew
case Language.Lithuanian =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ class DateFindTest extends FunSuite {
)
}

/*The output of a vertical OCR should also be horizontal and be able to be tested in the same way, so an additional test for vertical Japanese is not necessary.*/
test("find japanese dates") {
assertEquals(
DateFind
Expand Down
6 changes: 6 additions & 0 deletions modules/common/src/main/scala/docspell/common/Language.scala
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,11 @@ object Language {
val iso3 = "jpn"
}

/*It's not an ISO value, but this needs to be unique and tesseract will need jpn_vert for it's scan from the config of /etc/docspell-joex/docspell-joex.conf.*/
case object JpnVert extends Language {
val iso2 = "ja_vert"
val iso3 = "jpn_vert"
}
case object Hebrew extends Language {
val iso2 = "he"
val iso3 = "heb"
Expand Down Expand Up @@ -172,6 +177,7 @@ object Language {
Romanian,
Latvian,
Japanese,
JpnVert,
Hebrew,
Lithuanian,
Polish,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ object FtsRepository extends DoobieMeta {
case Language.Czech => "simple"
case Language.Latvian => "simple"
case Language.Japanese => "simple"
case Language.JpnVert => "simple"
case Language.Hebrew => "simple"
case Language.Lithuanian => "simple"
case Language.Polish => "simple"
Expand Down
8 changes: 8 additions & 0 deletions modules/webapp/src/main/elm/Data/Language.elm
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ type Language
| Dutch
| Latvian
| Japanese
| JpnVert
| Hebrew
| Hungarian
| Lithuanian
Expand Down Expand Up @@ -90,6 +91,9 @@ fromString str =
else if str == "jpn" || str == "ja" || str == "japanese" then
Just Japanese

else if str == "jpn_vert" || str == "ja_vert" || str == "jpnvert" then
Just Japanese

else if str == "heb" || str == "he" || str == "hebrew" then
Just Hebrew

Expand Down Expand Up @@ -169,6 +173,9 @@ toIso3 lang =
Japanese ->
"jpn"

JpnVert ->
"jpn_vert"

Hebrew ->
"heb"

Expand Down Expand Up @@ -212,6 +219,7 @@ all =
, Romanian
, Latvian
, Japanese
, JpnVert
, Hebrew
, Hungarian
, Lithuanian
Expand Down
9 changes: 9 additions & 0 deletions modules/webapp/src/main/elm/Messages/Data/Language.elm
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,9 @@ gb lang =
Japanese ->
"Japanese"

JpnVert ->
"JpnVert"

Hebrew ->
"Hebrew"

Expand Down Expand Up @@ -141,6 +144,9 @@ de lang =
Japanese ->
"Japanisch"

JpnVert ->
"JpnSenk"

Hebrew ->
"Hebräisch"

Expand Down Expand Up @@ -217,6 +223,9 @@ fr lang =
Japanese ->
"Japonnais"

JpnVert ->
"JpnVert"

Hebrew ->
"Hébreu"

Expand Down
Loading