-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format of train_listfile #841
Comments
Sample attached. File is created by the tesstrain.sh process. If on 'unix' you can try the following to create the file -
You may need to give the path before *.lstmf or the next step will not find the files. |
The line break must be http://stackoverflow.com/questions/8195839/choose-newline-character-in-notepad |
Thanks @Shreeshrii, thanks @amitdo! This raises a new question: how do I generate ZIP file with one TIF and box file I'm trying to use: Wilson1852_0.zip. Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the |
WordStr box files are not yet supported (AFAIK).
If you have box files in 3.0 format, you can use jtessboxeditor to add the
end of line tab character and use them.
When I want to test using box/tiff pairs, I copy the files to the training
directory - by modifying tesstrain.sh.
mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"
#cp /home/shree/tesstutorial/larmbig/*.tif "${TRAINING_DIR}/"
#cp /home/shree/tesstutorial/larmbig/*.box "${TRAINING_DIR}/"
Then use a command similar to following (based on location of your files)
and use just one font similar to the one used in your box/tiff pairs.
You may need to modify tesstrain_utils.sh to make sure that all your
box/tiff pairs are selected (based on the naming).
training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
--training_text ../langdata/eng/eng.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/engtest
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Apr 26, 2017 at 6:32 PM, Bert Spaan ***@***.***> wrote:
Thanks! This raises a new question: how do I generate .lstfm files? I'm
trying Tesseract to train on New York city directories
<https://digitalcollections.nypl.org/items/b42866fb-b877-e4fc-e040-e00a1806275e>,
I have box files and TIFs. (Another question: can I already use WordStr
box files, some parts of the documentation say I can, others say I can't?)
ZIP file with TIF and box file I'm trying to use: Wilson1852_0.zip
<https://github.com/tesseract-ocr/tesseract/files/958420/Wilson1852_0.zip>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#841 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oxQuRa5kIoAXZ_0HyeRUZ_JPTZFzks5rz0CCgaJpZM4NIBJS>
.
|
What's the output for the 388½ in this example and in other places in this book? |
@amitdo 388½ becomes 3884. |
@Shreeshrii : but I have no |
I'm afraid this way of training is not well documented right now. I have not yet tried training with 4.00. |
@amitdo Is it not well documented, or not yet possible at all? |
@Shreeshrii Do you have examples of this process? |
Your box file is in wordstr format. That cannot be used with existing
process.
If you had box file in older 3.04 format, then the hacked version of
script would work.
- excuse the brevity, sent from mobile
…On 26-Apr-2017 9:02 PM, "ShreeDevi Kumar" ***@***.***> wrote:
I will post my modified versions of the scripts tomorrow, don't have
access to my pc right now.
- excuse the brevity, sent from mobile
On 26-Apr-2017 8:41 PM, "Bert Spaan" ***@***.***> wrote:
> @Shreeshrii <https://github.com/Shreeshrii> Do you have examples of this
> process?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#841 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o0ChPIZezAouqzZA3thLi0m2ovFqks5rz16zgaJpZM4NIBJS>
> .
>
|
I can provide a box file in 3.04 format tomorrow, I'll post the file here. |
As said, the WordStr format is not really supported right now. You can still train with the regular box format + tab lines to signal line breaks. Training from 'real' images as opposed to synthetic ones (with text2image), that what's not well documented. |
@bertspaan the desired_characters file is not directly used for training. It is used at Google for building the large training text required for LSTM training. I couldn't find any font which has 1/2 the way it is printed here, so it maybe difficult to create synthetic image for it. |
Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.Here are the modified scripts: You cannot use finetune process because 1/2 i not included in the unicharset for current LSTM traineddata for English. @theraysmith , will this change with your next update? The following commands outline the process you may need to follow to do the LSTM training - top layer.
|
Thanks so much, I will try all this next week! |
@Shreeshrii: ha, that's my repository! |
I see that you have trained models for ocropy. Is there anything you want to share about ocropy vs. Tesseract 4.00, accuracy wise, with your dataset? |
@bertspaan :-) Since you already have an OCR process working, I suggest you wait for Ray to update code for training from scanned images and improve traineddata to support 1/2. My hacked training is only proof of concept (i trained till about 2% accuracy) so while it recognizes 1/2 as %, other letters may not be as accurate as the traineddata from the repo. |
@amitdo: yes, we've trained ocropy on a very small amount of sentences, and already the results are pretty good. See @Shreeshrii: ok, I'll try some of the commands you've posted here, but I'm not going to spend much time on trying to train Tesseract, I'll wait until training from scanned images is improved. We are also building dictionaries of possible names, streets and professions, so we should be able to fix many OCR errors afterwards. Thank you both so much for your help! |
@Shreeshrii |
Please see the wiki page on training, there have been changes made to LSTM training process. |
combine_lang_model which takes as input an input_unicharset and script_dir (script_dir points to the langdata directory) and optional word list files... |
https://github.com/tesseract-ocr/langdata
is the script_dir.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Aug 16, 2017 at 12:39 PM, CoCa520 ***@***.***> wrote:
combine_lang_model which takes as input an input_unicharset and script_dir
(script_dir points to the langdata directory) and optional word list
files...
I have got input_unicharset, but I don't know how can I get script_dir .
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#841 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_owv9Cyw11QES-BfmZH8KPSm_b_Ijks5sYpWlgaJpZM4NIBJS>
.
|
@CoCa520 Also see #590 (comment) |
@Shreeshrii Thank you! I have tried: |
tesseract eng.font.exp0.tif eng.font.exp0.box lstm.train
you need a space after box to give the name of config file.
Best method is to follow the training tutorial. If you want more pages,
change tesstrain_utils.sh for max_page
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Aug 30, 2017 at 2:02 PM, CoCa520 ***@***.***> wrote:
@Shreeshrii <https://github.com/shreeshrii> Thank you!
BUT
I really can't understand how can I create lstm files.
Can you show me the code.
I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x29173e0 still has count 1 (id /usr/local/tesseract/share/
tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x2916420 still has count 1 (id /usr/local/tesseract/share/
tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.
traineddatanumber-dawg)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#841 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozyFOlba1PIkz4zDYhN6YsomGE8Eks5sdR44gaJpZM4NIBJS>
.
|
Training tutorial ? |
4.0 training with tif/box pairs is not yet supported.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Aug 30, 2017 at 2:52 PM, CoCa520 ***@***.***> wrote:
Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/
TrainingTesseract-4.00 <http://url>
but I just have tif/box pairs, so i come here for more information。
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#841 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o9zx0GhGvGJcTyC7Cp40PtGBxTKhks5sdSnrgaJpZM4NIBJS>
.
|
@Shreeshrii If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy? |
Others who have done licensed plate recognition may be able to give you
better tips.
For your user case, I think using an older version of tesseract, specially
one which supports the 'digits' config file for limiting output to numbers
may be a better choice than using 4.0alpha.
…On 31-Aug-2017 9:27 AM, "CoCa520" ***@***.***> wrote:
@Shreeshrii <https://github.com/shreeshrii>
I want to use tesseract4.00 to recognize models of machines. All model
information are combines whit characters and numbers and located in
somewhere of nameplate, so I have collected lots of pictures which contains
various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
[image: image]
<https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png>
[image: image]
<https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png>
And then I put all model pictures into tesseract for recognize, but the
accuracy is not so good, so I am trying to train teaaeract4.00 with model
pictures.
The tesseract4.0 training tutorial said that there are two ways to create
training data, and I use the first option: each line in the box file
matches a 'character' (glyph) in the tiff image.
If 4.0 training with tif/box pairs is not yet supported then how can I do
to raise the accuracy?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#841 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS>
.
|
also see https://github.com/openalpr/openalpr which uses tesseract-ocr
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Aug 31, 2017 at 9:27 AM, CoCa520 ***@***.***> wrote:
@Shreeshrii <https://github.com/shreeshrii>
I want to use tesseract4.00 to recognize models of machines. All model
information are combines whit characters and numbers and located in
somewhere of nameplate, so I have collected lots of pictures which contains
various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
[image: image]
<https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png>
[image: image]
<https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png>
And then I put all model pictures into tesseract for recognize, but the
accuracy is not so good, so I am trying to train teaaeract4.00 with model
pictures.
The tesseract4.0 training tutorial said that there are two ways to create
training data, and I use the first option: each line in the box file
matches a 'character' (glyph) in the tiff image.
If 4.0 training with tif/box pairs is not yet supported then how can I do
to raise the accuracy?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#841 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS>
.
|
@minly, hello, I have the same problems with you ? have you resolved them? |
@CoCa520, hello, did you generate lstm files finally? I want to know how to generate lstm files according to *.tif and *.box files? |
Bumping this.. I tried running the steps mentioned here: #841 (comment) I'm getting this error:
It does not create a I'll try to compare what edits @Shreeshrii put in place, but any guidance would be appreciated. Edit: I've diffed the three files to a version in April, grabbed the "intent" of @Shreeshrii 's edit, and applied it to the newest versions of the three files.
While editing, I saw that @Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder, and to be honest I have no clarity on what the |
GT was my groundtruth folder, in which I also copied the box/tiff pairs for future reference. They are NOT used in further training, as LSTM training now uses the generated lstmf files and starter traineddata. |
@zdenop Please close this issue. |
The documentation does not seem to specify the format of the text files required by the
train_listfile
option. Are there examples available ofeng.training_files.txt
?The text was updated successfully, but these errors were encountered: