Format of train_listfile #841

bertspaan · 2017-04-25T20:32:18Z

The documentation does not seem to specify the format of the text files required by the train_listfile option. Are there examples available of eng.training_files.txt?

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2017-04-26T02:42:09Z

eng.training_files.txt

Sample attached. File is created by the tesstrain.sh process. If on 'unix' you can try the following to create the file -

ls -1 *.lstmf > lang.training_files.txt

You may need to give the path before *.lstmf or the next step will not find the files.

amitdo · 2017-04-26T07:25:01Z

The line break must be \n. This is what is inserted automatically when you hit the Enter key in the keyboard in Linux/macOS. In Windows it's by default \r\n, which will confuse Tesseract.

http://stackoverflow.com/questions/8195839/choose-newline-character-in-notepad

bertspaan · 2017-04-26T13:02:50Z

Thanks @Shreeshrii, thanks @amitdo!

This raises a new question: how do I generate .lstfm files? I'm trying Tesseract to train on New York city directories, I have box files and TIFs. (Another question: can I already use WordStr box files, some parts of the documentation say I can, others say I can't?)

ZIP file with one TIF and box file I'm trying to use: Wilson1852_0.zip. Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

Shreeshrii · 2017-04-26T13:30:17Z

WordStr box files are not yet supported (AFAIK). If you have box files in 3.0 format, you can use jtessboxeditor to add the end of line tab character and use them. When I want to test using box/tiff pairs, I copy the files to the training directory - by modifying tesstrain.sh. mkdir -p ${TRAINING_DIR} tlog "\n=== Starting training for language '${LANG_CODE}'" #cp /home/shree/tesstutorial/larmbig/*.tif "${TRAINING_DIR}/" #cp /home/shree/tesstutorial/larmbig/*.box "${TRAINING_DIR}/" Then use a command similar to following (based on location of your files) and use just one font similar to the one used in your box/tiff pairs. You may need to modify tesstrain_utils.sh to make sure that all your box/tiff pairs are selected (based on the naming). training/tesstrain.sh \ --fonts_dir /mnt/c/Windows/Fonts \ --training_text ../langdata/eng/eng.training_text \ --langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --lang eng \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fontlist "Arial" \ --output_dir ~/tesstutorial/engtest ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 26, 2017 at 6:32 PM, Bert Spaan ***@***.***> wrote: Thanks! This raises a new question: how do I generate .lstfm files? I'm trying Tesseract to train on New York city directories <https://digitalcollections.nypl.org/items/b42866fb-b877-e4fc-e040-e00a1806275e>, I have box files and TIFs. (Another question: can I already use WordStr box files, some parts of the documentation say I can, others say I can't?) ZIP file with TIF and box file I'm trying to use: Wilson1852_0.zip <https://github.com/tesseract-ocr/tesseract/files/958420/Wilson1852_0.zip> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#841 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_oxQuRa5kIoAXZ_0HyeRUZ_JPTZFzks5rz0CCgaJpZM4NIBJS> .

amitdo · 2017-04-26T13:36:33Z

What's the output for the 388½ in this example and in other places in this book?

bertspaan · 2017-04-26T13:38:40Z

@amitdo 388½ becomes 3884.

bertspaan · 2017-04-26T13:40:43Z

@Shreeshrii : but I have no fonts_dir, fontlist, etc, since I am only training from images.

amitdo · 2017-04-26T13:50:55Z

I'm afraid this way of training is not well documented right now.

I have not yet tried training with 4.00.

bertspaan · 2017-04-26T14:06:54Z

@amitdo Is it not well documented, or not yet possible at all?

bertspaan · 2017-04-26T15:11:38Z

@Shreeshrii Do you have examples of this process?

Shreeshrii · 2017-04-26T15:42:35Z

Your box file is in wordstr format. That cannot be used with existing process. If you had box file in older 3.04 format, then the hacked version of script would work. - excuse the brevity, sent from mobile

…

On 26-Apr-2017 9:02 PM, "ShreeDevi Kumar" ***@***.***> wrote: I will post my modified versions of the scripts tomorrow, don't have access to my pc right now. - excuse the brevity, sent from mobile On 26-Apr-2017 8:41 PM, "Bert Spaan" ***@***.***> wrote: > @Shreeshrii <https://github.com/Shreeshrii> Do you have examples of this > process? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#841 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o0ChPIZezAouqzZA3thLi0m2ovFqks5rz16zgaJpZM4NIBJS> > . >

bertspaan · 2017-04-26T15:44:29Z

I can provide a box file in 3.04 format tomorrow, I'll post the file here.

amitdo · 2017-04-26T16:13:14Z

As said, the WordStr format is not really supported right now.

You can still train with the regular box format + tab lines to signal line breaks.

Training from 'real' images as opposed to synthetic ones (with text2image), that what's not well documented.

Shreeshrii · 2017-04-27T12:59:11Z

Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

@bertspaan the desired_characters file is not directly used for training. It is used at Google for building the large training text required for LSTM training.

I couldn't find any font which has 1/2 the way it is printed here, so it maybe difficult to create synthetic image for it.

Shreeshrii · 2017-04-27T13:07:33Z

Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.

Here are the modified scripts:
boxtrain.zip
You will need to copy your box/tiff pairs to the
../langdata/eng/ directory
for them to be used.

You cannot use finetune process because 1/2 i not included in the unicharset for current LSTM traineddata for English. @theraysmith , will this change with your next update?

The following commands outline the process you may need to follow to do the LSTM training - top layer.


training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng  \
  --exposures "-2 -1 0" \
  --fontlist "Century Schoolbook" "Dejavu Serif" "Garamond" "Liberation Serif" "Times New Roman," "FreeSerif" "Georgia" \
  --output_dir ~/tesstutorial/nydlegacy
  
cp ~/tesstutorial/nydlegacy/eng.traineddata ./tessdata/nydlegacy.traineddata

training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "-2 -1" \
  --fontlist "Bookman Old Style Semi-Light"  \
  --output_dir ~/tesstutorial/nyd
  
rm -rf ~/tesstutorial/eng_from_nyd
mkdir -p ~/tesstutorial/eng_from_nyd

combine_tessdata -e ../tessdata/eng.traineddata \
   ~/tesstutorial/eng_from_nyd/eng.lstm

lstmtraining  \
   -U ~/tesstutorial/nyd/eng.unicharset \
  --train_listfile ~/tesstutorial/nyd/eng.training_files.txt \
  --script_dir ../langdata   \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --continue_from ~/tesstutorial/eng_from_nyd/eng.lstm \
  --model_output ~/tesstutorial/eng_from_nyd/nyd \
  --debug_interval -1 \
  --target_error_rate 0.01
   
lstmtraining \
  --continue_from ~/tesstutorial/eng_from_nyd/nyd_checkpoint \
  --model_output ~/tesstutorial/eng_from_nyd/nyd.lstm \
  --stop_training

cp ../tessdata/eng.traineddata ~/tesstutorial/eng_from_nyd/nyd.traineddata
   
combine_tessdata -o ~/tesstutorial/eng_from_nyd/nyd.traineddata \
  ~/tesstutorial/eng_from_nyd/nyd.lstm \
  ~/tesstutorial/nyd/eng.lstm-number-dawg \
  ~/tesstutorial/nyd/eng.lstm-punc-dawg \
  ~/tesstutorial/nyd/eng.lstm-word-dawg 
 
cp ~/tesstutorial/eng_from_nyd/nyd.traineddata ./tessdata/nyd.traineddata

bertspaan · 2017-04-29T21:24:57Z

Thanks so much, I will try all this next week!

Shreeshrii · 2017-04-30T03:42:48Z

Also see https://github.com/nypl-spacetime/ocr-scripts

bertspaan · 2017-04-30T18:38:21Z

@Shreeshrii: ha, that's my repository!

amitdo · 2017-04-30T21:19:21Z

@bertspaan

I see that you have trained models for ocropy.

Is there anything you want to share about ocropy vs. Tesseract 4.00, accuracy wise, with your dataset?

Shreeshrii · 2017-05-01T04:11:06Z

@bertspaan :-)

Since you already have an OCR process working, I suggest you wait for Ray to update code for training from scanned images and improve traineddata to support 1/2.

My hacked training is only proof of concept (i trained till about 2% accuracy) so while it recognizes 1/2 as %, other letters may not be as accurate as the traineddata from the repo.

bertspaan · 2017-05-01T15:13:18Z

@amitdo: yes, we've trained ocropy on a very small amount of sentences, and already the results are pretty good. See 1854-55.lines.ndjson.zip, this file contains all bounding boxes with ocropy output. However, ocropy sometimes crashes and its documentation is not too good, that's why last week we've started experimenting with Tesseract 4. I haven't compared out-of-the-box output of Tesseract 4 with our trained orcopy model in detail.

@Shreeshrii: ok, I'll try some of the commands you've posted here, but I'm not going to spend much time on trying to train Tesseract, I'll wait until training from scanned images is improved.

We are also building dictionaries of possible names, streets and professions, so we should be able to fix many OCR errors afterwards.

Thank you both so much for your help!

minly · 2017-08-15T10:21:16Z

@Shreeshrii
I am also trying to fine tune tesseract4.0 with images. I am confused by several parameters below.
First, what is the training_text(nyd.training_text) file? Do I need to create it? If yes, how to create it?
Second, do I just need to specify the --training_text and --output_dir while leaving other parameters unchanged?

Shreeshrii · 2017-08-15T11:19:43Z

Please see the wiki page on training, there have been changes made to LSTM training process.

xiaochenFang · 2017-08-16T07:09:13Z

combine_lang_model which takes as input an input_unicharset and script_dir (script_dir points to the langdata directory) and optional word list files...
I have got input_unicharset, but I don't know how can I get script_dir .

Shreeshrii · 2017-08-16T07:12:44Z

https://github.com/tesseract-ocr/langdata is the script_dir. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 16, 2017 at 12:39 PM, CoCa520 ***@***.***> wrote: combine_lang_model which takes as input an input_unicharset and script_dir (script_dir points to the langdata directory) and optional word list files... I have got input_unicharset, but I don't know how can I get script_dir . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#841 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_owv9Cyw11QES-BfmZH8KPSm_b_Ijks5sYpWlgaJpZM4NIBJS> .

Shreeshrii · 2017-08-16T07:15:19Z

@CoCa520 Also see #590 (comment)

xiaochenFang · 2017-08-30T08:32:41Z

@Shreeshrii Thank you！
BUT
I really can't understand how can I create lstm files.
Can you show me the code.

I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x29173e0 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916420 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatanumber-dawg)

Shreeshrii · 2017-08-30T08:48:16Z

tesseract eng.font.exp0.tif eng.font.exp0.box lstm.train you need a space after box to give the name of config file. Best method is to follow the training tutorial. If you want more pages, change tesstrain_utils.sh for max_page ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 2:02 PM, CoCa520 ***@***.***> wrote: @Shreeshrii <https://github.com/shreeshrii> Thank you！ BUT I really can't understand how can I create lstm files. Can you show me the code. I have tried: tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train But it gives: Error during processing. ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x29173e0 still has count 1 (id /usr/local/tesseract/share/ tessdata/eng.traineddatapunc-dawg) ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916420 still has count 1 (id /usr/local/tesseract/share/ tessdata/eng.traineddataword-dawg) ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng. traineddatanumber-dawg) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#841 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozyFOlba1PIkz4zDYhN6YsomGE8Eks5sdR44gaJpZM4NIBJS> .

xiaochenFang · 2017-08-30T09:22:39Z

Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
but I just have tif/box pairs, so i come here for more information。

Shreeshrii · 2017-08-30T10:08:45Z

4.0 training with tif/box pairs is not yet supported. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 2:52 PM, CoCa520 ***@***.***> wrote: Training tutorial ? Do you mean https://github.com/tesseract-ocr/tesseract/wiki/ TrainingTesseract-4.00 <http://url> but I just have tif/box pairs, so i come here for more information。 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#841 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o9zx0GhGvGJcTyC7Cp40PtGBxTKhks5sdSnrgaJpZM4NIBJS> .

xiaochenFang · 2017-08-31T03:57:31Z

@Shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model information are combines whit characters and numbers and located in somewhere of nameplate, so I have collected lots of pictures which contains various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:

And then I put all model pictures into tesseract for recognize, but the accuracy is not so good, so I am trying to train teaaeract4.00 with model pictures.
The tesseract4.0 training tutorial said that there are two ways to create training data, and I use the first option: each line in the box file matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy?

Shreeshrii · 2017-08-31T08:22:42Z

Others who have done licensed plate recognition may be able to give you better tips. For your user case, I think using an older version of tesseract, specially one which supports the 'digits' config file for limiting output to numbers may be a better choice than using 4.0alpha.

…

On 31-Aug-2017 9:27 AM, "CoCa520" ***@***.***> wrote: @Shreeshrii <https://github.com/shreeshrii> I want to use tesseract4.00 to recognize models of machines. All model information are combines whit characters and numbers and located in somewhere of nameplate, so I have collected lots of pictures which contains various nameplate of each machine. After a series of processing, I have got lots pictures of model as follows: [image: image] <https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png> [image: image] <https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png> And then I put all model pictures into tesseract for recognize, but the accuracy is not so good, so I am trying to train teaaeract4.00 with model pictures. The tesseract4.0 training tutorial said that there are two ways to create training data, and I use the first option: each line in the box file matches a 'character' (glyph) in the tiff image. If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#841 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS> .

Shreeshrii · 2017-08-31T13:04:49Z

also see https://github.com/openalpr/openalpr which uses tesseract-ocr ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 31, 2017 at 9:27 AM, CoCa520 ***@***.***> wrote: @Shreeshrii <https://github.com/shreeshrii> I want to use tesseract4.00 to recognize models of machines. All model information are combines whit characters and numbers and located in somewhere of nameplate, so I have collected lots of pictures which contains various nameplate of each machine. After a series of processing, I have got lots pictures of model as follows: [image: image] <https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png> [image: image] <https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png> And then I put all model pictures into tesseract for recognize, but the accuracy is not so good, so I am trying to train teaaeract4.00 with model pictures. The tesseract4.0 training tutorial said that there are two ways to create training data, and I use the first option: each line in the box file matches a 'character' (glyph) in the tiff image. If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#841 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS> .

694376965 · 2017-10-18T08:30:44Z

@minly, hello, I have the same problems with you ? have you resolved them?

694376965 · 2017-10-18T08:36:59Z

@CoCa520, hello, did you generate lstm files finally? I want to know how to generate lstm files according to *.tif and *.box files?

dzjin · 2017-11-13T21:01:30Z

Bumping this..

I tried running the steps mentioned here: #841 (comment)

I'm getting this error:

ERROR: Non-existent flag -D
ERROR: /var/folders/vz/yqbfrgj91hqdj76mmpl2vjmw0000gn/T/tmp.W8q07ZtQ/eng/unicharset does not exist or is not readable

It does not create a .traineddata, which is what I expected from doing --linedata_only parameter..

I'll try to compare what edits @Shreeshrii put in place, but any guidance would be appreciated.

Edit:

boxtrain.zip

I've diffed the three files to a version in April, grabbed the "intent" of @Shreeshrii 's edit, and applied it to the newest versions of the three files.

boxtrain/boxtrain.sh --fonts_dir ~/Library/Fonts/ --training_text ../langdata/eng/eng.training_text --langdata_dir ../langdata --tessdata_dir ./tessdata/ --lang eng --fontlist "Calibri" --output_dir ./lstm1

While editing, I saw that @Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder, and to be honest I have no clarity on what the .traineddata contains .

Shreeshrii · 2018-03-27T07:11:18Z

@Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder

GT was my groundtruth folder, in which I also copied the box/tiff pairs for future reference.

They are NOT used in further training, as LSTM training now uses the generated lstmf files and starter traineddata.

Shreeshrii · 2019-02-25T12:35:39Z

@zdenop Please close this issue.

Shreeshrii mentioned this issue Apr 27, 2017

Add vulgar fraction for 1/2 tesseract-ocr/langdata#69

Open

Shreeshrii mentioned this issue Mar 27, 2018

how to generate *.lstmf files according to *.tif and *.box files while fine tuning with tesseract 4.0 #1172

Closed

zdenop closed this as completed Feb 25, 2019

Format of train_listfile #841

Format of train_listfile #841

Comments

bertspaan commented Apr 25, 2017

Shreeshrii commented Apr 26, 2017

amitdo commented Apr 26, 2017 • edited Loading

bertspaan commented Apr 26, 2017 • edited Loading

Shreeshrii commented Apr 26, 2017 via email

amitdo commented Apr 26, 2017 • edited Loading

bertspaan commented Apr 26, 2017

bertspaan commented Apr 26, 2017

amitdo commented Apr 26, 2017

bertspaan commented Apr 26, 2017

bertspaan commented Apr 26, 2017

Shreeshrii commented Apr 26, 2017 via email

bertspaan commented Apr 26, 2017

amitdo commented Apr 26, 2017

Shreeshrii commented Apr 27, 2017

Shreeshrii commented Apr 27, 2017 • edited Loading

Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.

bertspaan commented Apr 29, 2017

Shreeshrii commented Apr 30, 2017

bertspaan commented Apr 30, 2017

amitdo commented Apr 30, 2017 • edited Loading

Shreeshrii commented May 1, 2017

bertspaan commented May 1, 2017

minly commented Aug 15, 2017

Shreeshrii commented Aug 15, 2017

xiaochenFang commented Aug 16, 2017

Shreeshrii commented Aug 16, 2017 via email

Shreeshrii commented Aug 16, 2017

xiaochenFang commented Aug 30, 2017

Shreeshrii commented Aug 30, 2017 via email

xiaochenFang commented Aug 30, 2017

Shreeshrii commented Aug 30, 2017 via email

xiaochenFang commented Aug 31, 2017

Shreeshrii commented Aug 31, 2017 via email

Shreeshrii commented Aug 31, 2017 via email

694376965 commented Oct 18, 2017

694376965 commented Oct 18, 2017

dzjin commented Nov 13, 2017 • edited Loading

Shreeshrii commented Mar 27, 2018

Shreeshrii commented Feb 25, 2019

amitdo commented Apr 26, 2017 •

edited

Loading

bertspaan commented Apr 26, 2017 •

edited

Loading

amitdo commented Apr 26, 2017 •

edited

Loading

Shreeshrii commented Apr 27, 2017 •

edited

Loading

amitdo commented Apr 30, 2017 •

edited

Loading

dzjin commented Nov 13, 2017 •

edited

Loading