Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build n-gram language model for DeepSpeech2, and add inference interfaces insertable to CTC decoder. #2229

Closed
xinghai-sun opened this issue May 22, 2017 · 3 comments
Assignees

Comments

@xinghai-sun
Copy link
Contributor

xinghai-sun commented May 22, 2017

  • Train an Engish language model (Kneser-Ney smoothed 5-gram, with pruning), with KenLM toolkit, on cleaned text from the Common Crawl Repository. For detailed requirements please refer to DS2 paper.
  • Add the training script into the DS2 trainer script.
  • Add inference interfaces for this n-gram language model, insertable to CTC-LM-beam-search for decoding.
  • Keep in mind that the interfaces should be compatible with both English (word-based LM) and Madarin (character-based LM).
  • Please work closely with the "Add CTC-LM-beam-search decoder" task.
  • Refer to the DS2 design doc and update it when necessary.
@pkuyym
Copy link
Contributor

pkuyym commented May 31, 2017

@cxwangyi @kuke @xinghai-sun
Hi, as mentioned in the paper, a language model has to be trained to improve the generating results and the LM is a critical component to ensure the performance. The language model is trained on texts crawled from commoncrawl.org using KenLM toolkit. However, we need more details to train such a language model. Any possible to get the trained language model or text dataset trained on?

@xinghai-sun
Copy link
Contributor Author

  1. 英文的语料应该很多,不一定拘泥于paper提到的语料,我们可以先小语料试,例如PTB。
  2. n-gram LM训练的工具尽可能先用KenLM;如果不用,也尽可能保证 smooth的方法对齐或合理。
  3. 重点关注model loading和inference的接口设计,并且做好和beam search decoder的联调。
  4. 联系NLP或者SVAIL看看有没有现成的powerful LM model,中英文都问下,请@lcy-seso 协助下。

heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 16, 2021
* update faster modelzoo and config, test=dygraph

* update model link, test=dygraph
@wwfcnu
Copy link

wwfcnu commented Apr 30, 2024

  1. 英文的语料应该很多,不一定拘泥于paper提到的语料,我们可以先小语料试,例如PTB。
  2. n-gram LM训练的工具尽可能先用KenLM;如果不用,也尽可能保证 smooth的方法对齐或合理。
  3. 重点关注model loading和inference的接口设计,并且做好和beam search decoder的联调。
  4. 联系NLP或者SVAIL看看有没有现成的powerful LM model,中英文都问下,请@lcy-seso 协助下。

@lcy-seso 请问有现成可用的LM model可以用吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants