Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more languages for recognition models by default #563

Closed
Tracked by #791
felixdittrich92 opened this issue Oct 30, 2021 · 8 comments
Closed
Tracked by #791

Adding more languages for recognition models by default #563

felixdittrich92 opened this issue Oct 30, 2021 · 8 comments
Assignees
Labels
module: models Related to doctr.models topic: text recognition Related to the task of text recognition type: enhancement Improvement
Milestone

Comments

@felixdittrich92
Copy link
Contributor

🚀 The feature

Currently you support only french by default would be great to add more languages directly to choose for example:
model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True, language='en')
or
reco_arch = crnn_mobilenet_v3_large(pretrained=True, language='de')
What do you think ?

Motivation, pitch

In most cases you need the recognition for a specific language this can be done by training yourself but it would be much easier if some often used languages can be used without own training

Alternatives

Some other ideas:

  • auto detect language and choose the right model if provided
  • multilingual model
  • or provide to choose multiple models like languages=['en', 'fr', 'de']
    Adding ViTSTR #513 (i think i will finish this at the end of the year on my side which will than provide de, en, es, fr in one model - but currently no benchmarks)

Additional context

If you want i can train all current existing models in pytorch for english and german

@fg-mindee fg-mindee self-assigned this Oct 30, 2021
@fg-mindee fg-mindee added the topic: text recognition Related to the task of text recognition label Oct 30, 2021
@fg-mindee
Copy link
Contributor

Hi @felixdittrich92,

Just to be clear, we support the French "vocab", not language. The library for now does not include any semantic understanding. For this reason, we took the French vocab as the number of accented characters included in it usually include most european characters.

Typically, there is no character in the English vocab that are not included in our "French" vocab.

Now, to switch to other vocabs, we will have to wait to stabilize first on this vocab. But the text recognition part will select the appropriate vocab & checkpoint depending the wish of the user. Here are a few options:

  • using a text recognition model with a vocab that includes all characters of all languages. It makes it universal, but it's certainly harder to train, may perform worse on common vocabs, and it will be slower/heavier.
  • carefully designing the available vocabs, proposing checkpoints for popular vocabs and let users submit theirs for alternative vocabs

The second option looks much better to me I would argue 😅 If the text recognition model encounters characters it has never seen, it will yield very low confidence which can easily be processed accordingly.

In any case, this won't be handled before the 0.6.0 release or later as rotation & handwritten text are more critical to handle for now! (in the meantime, we will draft a process for contributors to submit their trained model on a given vocab though)!

I hope that answers your question :)

@felixdittrich92
Copy link
Contributor Author

@fg-mindee
closed with #576

@fg-mindee
Copy link
Contributor

I may be mixing things but I don't feel like #576 is related to this issue? 🤔

@felixdittrich92
Copy link
Contributor Author

@fg-mindee
mh yes after rethinking it is better to hold both issues the idea was more like this:
add pretrained weights for all existing models before take care of different vocabs but you are right thats two different problems

@fg-mindee fg-mindee added this to the 0.6.0 milestone Dec 10, 2021
@fg-mindee fg-mindee added the module: models Related to doctr.models label Dec 10, 2021
@felixdittrich92
Copy link
Contributor Author

@frgfm I think the huggingface integration (sharing models) is enough and it is maybe better to improve the word generator instead of this issue/idea wdyt ?

@frgfm
Copy link
Collaborator

frgfm commented May 7, 2022

Fair point, but I would argue it's different topics:

  • the word generator is currently generating random character strings
  • it could be used to generate very specific words
  • being able to generate it, and having someone train a model in dozens of languages is tricky

To close this issue, I think we should decide whether it's a feature design issue (how multiple vocab models should be accessed as pretrained models) or a wider question. Now that we can use HF hub models, I think we'll only have to change the model name to switch to another language. So if that's not about the design part, I'd argue this has been addressed :)

@felixdittrich92
Copy link
Contributor Author

@frgfm
I think the point would be to provide some data (with the word generator - make it more robust and useful) that users are able to train on different vocabs and provide there models about the HF hub (so we could do anything like a pinned community call to grab a model provide some information what and how to do and in the end at the models to the list which was added with #896 - if we are stable with some others decisions like sar/master onnx 😅 ).
But generally it is enough if you can grab a trained model from the list, because I think, as you said, that it is not possible to cover every use case, which is why I would rather row back with the proposal language='xyz'.
In my opinion we can close this and maybe open another one to iterate on the word generator.
Wdyt ? :)

@frgfm
Copy link
Collaborator

frgfm commented May 16, 2022

I agree :)

But we could easily provide some HF Hub contribution guidelines for language (i.e. how you should name your model so that people can use it). "mindee/crnn_vgg16_bn_french" could easily be on the hub for instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: models Related to doctr.models topic: text recognition Related to the task of text recognition type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

3 participants