-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4.0 bugs on MAC OS X and a step by step for reference #1453
Comments
Thank you for step by step info. This should probably be added to wiki.
One correction:
When doing fine-tune training, ONLY traineddata files from tessdata_best
can be used as a base traineddata to continue from
Models from tessdata_fast as well as tessdata will NOT work.
…On Sun 8 Apr, 2018, 3:16 PM FernandoGOT, ***@***.***> wrote:
This is step by step that I used to install tesseract 4.0 on my MAC OS X
and the fixes/workaround I needed to do so I could make it work.
I'm sharing this "guide" with the intention of helping other people who
may have the same problems I had.
Special thanks for Shree that helped me at the google groups
Project and more details: https://github.com/tesseract-ocr/tesseract
where to get help?
google group: https://groups.google.com/forum/#!forum/tesseract-ocr
git: https://github.com/tesseract-ocr/tesseract/issues
Platform: MAC OS X 10.13.3
Tesseract: 4.0.0-beta.1-69-g10f4
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Compiling Tesseract - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos
Warning: Don't install tesseract using brew, since you can't generate the
ScrollView.jar from it! (At least I wasn't able to generate it)
Steps
1 - Install these libs
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc
2 - Run the code
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
Obs.: text2image is set to use icu4c/60.2 but the actual version is
icu4c/61.1
3 - Clone tesseract repo
git clone https://github.com/tesseract-ocr/tesseract/
4 - Enter in the folder
cd tesseract
5 - Run the script
./autogen.sh
6 - Run the code, and copy the CPPFLAGS and LDFLAGS
brew info icu4c
7 - Update the CPPFLAGS and LDFLAGS and execute the code
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/lib
8 - Run the code
make -j
9 - Run the code
sudo make install
10 - Run the code
sudo update_dyld_shared_cache
Obs.: this is the sudo ldconfig version for MAC OS X
11 - Run the code
make training
Creating ScrollView.jar - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line
https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging
Important: Use the JDK 8 to build, or else it is going to return an error
Steps
1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar
2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to
tesseract/java
3 - Enter the tesseract/java folder
cd java
4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the
code
SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
Training Font - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain
Steps
1 - Clone the langdata dir from git
git clone https://github.com/tesseract-ocr/langdata
2 - Enter the tesseract folder
cd ..
3 - Execute this code and select one font from the list (I recommend
"Verdana")
text2image --list_available_fonts --fonts_dir=/Library/Fonts
Font dir for MAC can be : ~/Library/Fonts
/Library/Fonts/
/Network/Library/Fonts/
/System/Library/Fonts/
/System Folder/Fonts/
More details here: https://support.apple.com/en-us/HT201722
4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh
from
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
Obs.: this is a fix for the error:
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied
5 - Clone the tessdata repo from git (i recommend the "tessdata_best"
since it is the more precise, "tessdata_fast" is just more fast)
git clone https://github.com/tesseract-ocr/tessdata_best
or
git clone https://github.com/tesseract-ocr/tessdata_fast
6 - Copy the tessdata_best/eng.traineddata (for english training) from
the tessdata you just cloned and past at tesseract/tessdata/
7 - Create the training data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/engtrain
Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
8 - Create other training data using other font to compare
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/engeval
Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
9 - Create the needed folder
mkdir -p ~/tesstutorial/engoutput
10 - Start the training
SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval
-1
11 - Monitor the log on another console
tail -f ~/tesstutorial/engoutput/basetrain.log
12 - Test Accuracy with other font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
13 - Test Accuracy with best traindata
~/projects/tesseract/training/lstmeval \
--model ~/projects/tessdata_best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
14 - Test Accuracy with actual traindata (in this case the same as step 13)
~/projects/tesseract/training/lstmeval \
--model ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Fine tuning - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
Steps
1 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_small
2 - Start to fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_small/verdana \
--continue_from ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 1200
3 - Validate the progress
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
4 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_full
5 - Combine the trained data
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/verdana_from_full/eng.lstm
6 - Train merged data
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_full/verdana \
--continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 400
7 - Validate the results on the main training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
8 - Validate the results on our training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Fine tuning add ± character - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
Steps
1 - Modify langdata/eng/eng.training_text and include these lines:
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
2 - Generate the training file
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
"Times New Roman, Bold" \
"Times New Roman, Bold Italic" \
"Times New Roman, Italic" \
"Courier New" \
"Courier New Bold" \
"Courier New Bold Italic" \
"Courier New Italic" \
--output_dir ~/tesstutorial/trainplusminus
3 - Generate the eval data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/evalplusminus
4 - Combine trained data files
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/trainplusminus/eng.lstm
5 - Fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/trainplusminus/plusminus \
--continue_from ~/tesstutorial/trainplusminus/eng.lstm \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 3600
6 - Test the result on other fonts
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
6 - Test the result test on main font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1453>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE2_oy-BFI7DnIs0HYfIUQvk9uZT7aU3ks5tmdxdgaJpZM4TLeJ9>
.
|
@FernandoGOT Thank you. /// As you know, @Shreeshrii he mentioned about problem - Fine tune -training. So I hope so. This page will be reflected soon . Thank you |
This is a great resource! It would be even more amazing if it were in the form of a pull request of changes to the existing documentation so that it could be improved to avoid these problems for other OS X users. |
I followed @FernandoGOT steps but I am getting: |
@kas84 please post results of tesseract -v Version info. Are you using latest source from Github ? |
@Shreeshrii I cloned the repo like so |
tesseract -v |
Yeah, I forgot, sorry!
|
Usually tesseract -v should also show the tesseract version. Is the error only with --list-langs Are you able to recognize any test images? |
My bad:
It also happens when trying to recognize an image, yes. |
What commands are you using? What tessdata-dir are you using? Eg. Where is eng.traineddata installed? |
What output do you get with the following? Use ./tessdata if you have copied eng.traineddata there.
Page 1 The quick brown dog jumped over the |
The space here confuses the command line options parser. |
Has any one built a dockerfile out of this ? |
No. It was due to wrong command line usage. |
Please use the forum for asking questions. |
Okay, sorry! |
@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you. |
@ysnnzlcn I'm out of times these days (working too much), but when I get some free time I'm going to make a better step-by-step of how to use tesseract and send a merge to the docs |
@FernandoGOT That would be great, looking forward to it. Thanks |
Under Training Font -- Tesseract 4.0, Step 7, I get a failure:
I have:
My user is allowed to create files in that directory, and the directory itself is present. Please advise. |
Hi, when I try installing this it breaks here:
I really would like to get this working - I've spent a lot of time getting something running...any help or pointers to instructions would be greatly appreciated.. |
@FernandoGOT @Shreeshrii : can you put the instruction to wiki? I would like to close this issue (related to build process). it is to long and other people mixed other topics (training) here. |
I am having this issue too, has this been resolved here or somewhere else?? |
@jamesoneill54 https://stackoverflow.com/questions/33259191/installing-libicu-dev-on-mac/33352241 this is work for me |
I suggest to close this issue. Part of the information given here is no longer up to date. |
@amitdo You can find my edits in the history for the wiki page.
That's not something I have time to tackle.
@stweil I suggested exactly that back in Oct 2018, so obviously agree. :) If people run into new problems, they can open new issues (or just update the wiki with the necessary corrections). |
Did anyone manage to overcome the following error:
And if so how? |
|
@stweil How do I diagnose which requirements are missing and why |
nvm,
|
Obviously you found the answer yourself: |
I am getting an error when 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'. Error : 'text2image: not found'. Can you please suggest me a direction on how i can tackle this issue? MacOS : 10.14.6 |
@khalajink, I suggest to ask for help at the user forum. |
@khalajink Did you install the training tools (including text2image)? If so, where are they? Make sure you've included them on your $PATH. |
@jtlz2 I have followed the @FernandoGOT's comment, i do not see installation for text2image there, i suppose it comes along with icu4c. How do i include it in $PATH? When i try to run 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'. Also I see that you had and issue related to pango version 3 days ago, even i am facing this although i have pango 1.44.6 already installed. How did you happen to solve it? |
Solved the the pango issue by following https://stackoverflow.com/questions/55361379/osx-compiling-training-tools-for-tesseract-4-0-pango-libraries-not-found
|
@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819 |
@jtlz2 Yes i followed your answer got the pango issue fixed but text2image issue still exists. Any idea about it?
|
Thanks for the answer. The commands you shared didn't work for me but the instruction on how to diagnose the issue helped a lot. It turns out that I do not have |
I have a different but slightly similar problem in 2020 still. I've successfully installed the latest Tesseract (master branch) on the latest OSX (11.1 Big Sur).
However, my training tools (even though they have been installed) could not find the actual files. For example, if I call a text2image I see the following error message
If I enable Debug for the bash script I see the following problem
basically, all training tools can't find thier actual executable files which are located under `tesseract/.libs/ Did I miss something during the configuration? |
@nnnikolay, I am sorry, that was my fault. It is now fixed with commit 421ebf0. |
Builds which were configured with --enable-shared did install the wrong files. Using libtool fixes that. Add also other flags which are used by the automake default install. Signed-off-by: Stefan Weil <sw@weilnetz.de>
wow, @stweil thank you for your swift reaction. it seems that this step works now! |
You can see the error detail in |
This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work.
I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.
Special thanks for Shree that helped me at the google groups
Project and more details: https://github.com/tesseract-ocr/tesseract
where to get help?
google group: https://groups.google.com/forum/#!forum/tesseract-ocr
git: https://github.com/tesseract-ocr/tesseract/issues
Platform: MAC OS X 10.13.3
Tesseract: 4.0.0-beta.1-69-g10f4
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Compiling Tesseract - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos
Warning: Don't install tesseract using brew, since you can't generate the
ScrollView.jar
from it! (At least I wasn't able to generate it)Steps
1 - Install these libs
2 - Run the code
Obs.:
text2image
is set to use icu4c/60.2 but the actual version is icu4c/61.13 - Clone tesseract repo
4 - Enter in the folder
5 - Run the script
6 - Run the code, and copy the
CPPFLAGS
andLDFLAGS
7 - Update the
CPPFLAGS
andLDFLAGS
and execute the code8 - Run the code
9 - Run the code
10 - Run the code
Obs.: this is the
sudo ldconfig
version for MAC OS X11 - Run the code
Creating ScrollView.jar - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line
https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging
Important: Use the JDK 8 to build, or else it is going to return an error
Steps
1 - Download the files
piccolo2d-core-3.0.jar
andpiccolo2d-extras-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar
2 - Move the files
piccolo2d-core-3.0.jar
andpiccolo2d-extras-3.0.jar
totesseract/java
3 - Enter the
tesseract/java
folder4 - Set the var
SCROLLVIEW_PATH
to yourtesseract/java
folder and run the codeTraining Font - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain
Steps
1 - Clone the langdata dir from git
2 - Enter the tesseract folder
3 - Execute this code and select one font from the list (I recommend "Verdana")
Font dir for MAC can be : ~/Library/Fonts
/Library/Fonts/
/Network/Library/Fonts/
/System/Library/Fonts/
/System Folder/Fonts/
More details here: https://support.apple.com/en-us/HT201722
4 - replace the line 195 at file
tesseract/training/tesstrain_utils.sh
fromObs.: this is a fix for the error:
5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)
or
6 - Copy the
tessdata_best/eng.traineddata
(for english training) from the tessdata you just cloned and past attesseract/tessdata/
7 - Create the training data
Add the prefix
PANGOCAIRO_BACKEND=fc
if using MAC OSX8 - Create other training data using other font to compare
Add the prefix
PANGOCAIRO_BACKEND=fc
if using MAC OSX9 - Create the needed folder
10 - Start the training
Case you failed to build ScrollView.jar, set debug_interval to -1
--debug_interval -1
11 - Monitor the log on another console
12 - Test Accuracy with other font
13 - Test Accuracy with best traindata
14 - Test Accuracy with actual traindata (in this case the same as step 13)
Fine tuning - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
Steps
1 - Create the necessary folder
2 - Start to fine tuning
3 - Validate the progress
4 - Create the necessary folder
5 - Combine the trained data
6 - Train merged data
7 - Validate the results on the main training file
8 - Validate the results on our training file
Fine tuning add ± character - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
Steps
1 - Modify
langdata/eng/eng.training_text
and include these lines:2 - Generate the training file
3 - Generate the eval data
4 - Combine trained data files
5 - Fine tuning
6 - Test the result on other fonts
6 - Test the result test on main font
The text was updated successfully, but these errors were encountered: