Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format of train_listfile #841

Closed
bertspaan opened this issue Apr 25, 2017 · 38 comments
Closed

Format of train_listfile #841

bertspaan opened this issue Apr 25, 2017 · 38 comments

Comments

@bertspaan
Copy link

The documentation does not seem to specify the format of the text files required by the train_listfile option. Are there examples available of eng.training_files.txt?

@Shreeshrii
Copy link
Collaborator

eng.training_files.txt

Sample attached. File is created by the tesstrain.sh process. If on 'unix' you can try the following to create the file -

ls -1 *.lstmf > lang.training_files.txt

You may need to give the path before *.lstmf or the next step will not find the files.

@amitdo
Copy link
Collaborator

amitdo commented Apr 26, 2017

The line break must be \n. This is what is inserted automatically when you hit the Enter key in the keyboard in Linux/macOS. In Windows it's by default \r\n, which will confuse Tesseract.

http://stackoverflow.com/questions/8195839/choose-newline-character-in-notepad

@bertspaan
Copy link
Author

bertspaan commented Apr 26, 2017

Thanks @Shreeshrii, thanks @amitdo!

This raises a new question: how do I generate .lstfm files? I'm trying Tesseract to train on New York city directories, I have box files and TIFs. (Another question: can I already use WordStr box files, some parts of the documentation say I can, others say I can't?)

ZIP file with one TIF and box file I'm trying to use: Wilson1852_0.zip. Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

image

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 26, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Apr 26, 2017

What's the output for the 388½ in this example and in other places in this book?

@bertspaan
Copy link
Author

@amitdo 388½ becomes 3884.

@bertspaan
Copy link
Author

@Shreeshrii : but I have no fonts_dir, fontlist, etc, since I am only training from images.

@amitdo
Copy link
Collaborator

amitdo commented Apr 26, 2017

I'm afraid this way of training is not well documented right now.

I have not yet tried training with 4.00.

@bertspaan
Copy link
Author

@amitdo Is it not well documented, or not yet possible at all?

@bertspaan
Copy link
Author

@Shreeshrii Do you have examples of this process?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 26, 2017 via email

@bertspaan
Copy link
Author

I can provide a box file in 3.04 format tomorrow, I'll post the file here.

@amitdo
Copy link
Collaborator

amitdo commented Apr 26, 2017

As said, the WordStr format is not really supported right now.

You can still train with the regular box format + tab lines to signal line breaks.

Training from 'real' images as opposed to synthetic ones (with text2image), that what's not well documented.

@Shreeshrii
Copy link
Collaborator

Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

@bertspaan the desired_characters file is not directly used for training. It is used at Google for building the large training text required for LSTM training.

I couldn't find any font which has 1/2 the way it is printed here, so it maybe difficult to create synthetic image for it.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 27, 2017

Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.

Here are the modified scripts:
boxtrain.zip
You will need to copy your box/tiff pairs to the
../langdata/eng/ directory
for them to be used.

You cannot use finetune process because 1/2 i not included in the unicharset for current LSTM traineddata for English. @theraysmith , will this change with your next update?

The following commands outline the process you may need to follow to do the LSTM training - top layer.


training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng  \
  --exposures "-2 -1 0" \
  --fontlist "Century Schoolbook" "Dejavu Serif" "Garamond" "Liberation Serif" "Times New Roman," "FreeSerif" "Georgia" \
  --output_dir ~/tesstutorial/nydlegacy
  
cp ~/tesstutorial/nydlegacy/eng.traineddata ./tessdata/nydlegacy.traineddata

training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "-2 -1" \
  --fontlist "Bookman Old Style Semi-Light"  \
  --output_dir ~/tesstutorial/nyd
  
rm -rf ~/tesstutorial/eng_from_nyd
mkdir -p ~/tesstutorial/eng_from_nyd

combine_tessdata -e ../tessdata/eng.traineddata \
   ~/tesstutorial/eng_from_nyd/eng.lstm

lstmtraining  \
   -U ~/tesstutorial/nyd/eng.unicharset \
  --train_listfile ~/tesstutorial/nyd/eng.training_files.txt \
  --script_dir ../langdata   \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --continue_from ~/tesstutorial/eng_from_nyd/eng.lstm \
  --model_output ~/tesstutorial/eng_from_nyd/nyd \
  --debug_interval -1 \
  --target_error_rate 0.01
   
lstmtraining \
  --continue_from ~/tesstutorial/eng_from_nyd/nyd_checkpoint \
  --model_output ~/tesstutorial/eng_from_nyd/nyd.lstm \
  --stop_training

cp ../tessdata/eng.traineddata ~/tesstutorial/eng_from_nyd/nyd.traineddata
   
combine_tessdata -o ~/tesstutorial/eng_from_nyd/nyd.traineddata \
  ~/tesstutorial/eng_from_nyd/nyd.lstm \
  ~/tesstutorial/nyd/eng.lstm-number-dawg \
  ~/tesstutorial/nyd/eng.lstm-punc-dawg \
  ~/tesstutorial/nyd/eng.lstm-word-dawg 
 
cp ~/tesstutorial/eng_from_nyd/nyd.traineddata ./tessdata/nyd.traineddata

@bertspaan
Copy link
Author

Thanks so much, I will try all this next week!

@Shreeshrii
Copy link
Collaborator

@bertspaan
Copy link
Author

@Shreeshrii: ha, that's my repository!

@amitdo
Copy link
Collaborator

amitdo commented Apr 30, 2017

@bertspaan

I see that you have trained models for ocropy.

Is there anything you want to share about ocropy vs. Tesseract 4.00, accuracy wise, with your dataset?

@Shreeshrii
Copy link
Collaborator

@bertspaan :-)

Since you already have an OCR process working, I suggest you wait for Ray to update code for training from scanned images and improve traineddata to support 1/2.

My hacked training is only proof of concept (i trained till about 2% accuracy) so while it recognizes 1/2 as %, other letters may not be as accurate as the traineddata from the repo.

@bertspaan
Copy link
Author

@amitdo: yes, we've trained ocropy on a very small amount of sentences, and already the results are pretty good. See 1854-55.lines.ndjson.zip, this file contains all bounding boxes with ocropy output. However, ocropy sometimes crashes and its documentation is not too good, that's why last week we've started experimenting with Tesseract 4. I haven't compared out-of-the-box output of Tesseract 4 with our trained orcopy model in detail.

@Shreeshrii: ok, I'll try some of the commands you've posted here, but I'm not going to spend much time on trying to train Tesseract, I'll wait until training from scanned images is improved.

We are also building dictionaries of possible names, streets and professions, so we should be able to fix many OCR errors afterwards.

Thank you both so much for your help!

@minly
Copy link

minly commented Aug 15, 2017

@Shreeshrii
I am also trying to fine tune tesseract4.0 with images. I am confused by several parameters below.
First, what is the training_text(nyd.training_text) file? Do I need to create it? If yes, how to create it?
Second, do I just need to specify the --training_text and --output_dir while leaving other parameters unchanged?

image

@Shreeshrii
Copy link
Collaborator

Please see the wiki page on training, there have been changes made to LSTM training process.

@xiaochenFang
Copy link

combine_lang_model which takes as input an input_unicharset and script_dir (script_dir points to the langdata directory) and optional word list files...
I have got input_unicharset, but I don't know how can I get script_dir .

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 16, 2017 via email

@Shreeshrii
Copy link
Collaborator

@CoCa520 Also see #590 (comment)

@xiaochenFang
Copy link

@Shreeshrii Thank you!
BUT
I really can't understand how can I create lstm files.
Can you show me the code.

I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x29173e0 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916420 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatanumber-dawg)

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 30, 2017 via email

@xiaochenFang
Copy link

Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
but I just have tif/box pairs, so i come here for more information。

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 30, 2017 via email

@xiaochenFang
Copy link

@Shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model information are combines whit characters and numbers and located in somewhere of nameplate, so I have collected lots of pictures which contains various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
image
image
And then I put all model pictures into tesseract for recognize, but the accuracy is not so good, so I am trying to train teaaeract4.00 with model pictures.
The tesseract4.0 training tutorial said that there are two ways to create training data, and I use the first option: each line in the box file matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 31, 2017 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 31, 2017 via email

@694376965
Copy link

@minly, hello, I have the same problems with you ? have you resolved them?

@694376965
Copy link

@CoCa520, hello, did you generate lstm files finally? I want to know how to generate lstm files according to *.tif and *.box files?

@dzjin
Copy link

dzjin commented Nov 13, 2017

Bumping this..

I tried running the steps mentioned here: #841 (comment)

I'm getting this error:

ERROR: Non-existent flag -D
ERROR: /var/folders/vz/yqbfrgj91hqdj76mmpl2vjmw0000gn/T/tmp.W8q07ZtQ/eng/unicharset does not exist or is not readable

It does not create a .traineddata, which is what I expected from doing --linedata_only parameter..

I'll try to compare what edits @Shreeshrii put in place, but any guidance would be appreciated.

Edit:

boxtrain.zip

I've diffed the three files to a version in April, grabbed the "intent" of @Shreeshrii 's edit, and applied it to the newest versions of the three files.

boxtrain/boxtrain.sh --fonts_dir ~/Library/Fonts/ --training_text ../langdata/eng/eng.training_text --langdata_dir ../langdata --tessdata_dir ./tessdata/ --lang eng --fontlist "Calibri" --output_dir ./lstm1

While editing, I saw that @Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder, and to be honest I have no clarity on what the .traineddata contains .

@Shreeshrii
Copy link
Collaborator

@Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder

GT was my groundtruth folder, in which I also copied the box/tiff pairs for future reference.

They are NOT used in further training, as LSTM training now uses the generated lstmf files and starter traineddata.

@Shreeshrii
Copy link
Collaborator

@zdenop Please close this issue.

@zdenop zdenop closed this as completed Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants