Language Model Training 🇧🇷

🦊 Pre-trained models on GitLab (LFS): https://gitlab.com/fb-resources/lm-br

Scripts to train n-gram language models in ARPA format, currently using only SRILM. Based on Kaldi scripts for LibriSpeech (local/lm/train_lm.sh).

It also generates a list of top-N most frequent, estimated from OSCAR.

A demonstration is performed over five files of the first version (2019) of the OSCAR corpus. Raw data sums up to 8.0 GB while after normalisation clean data is 6.25 GB.

Usage

⚠️ Remember SRILM must be installed and set to $PATH beforehand.

$ git clone https://github.com/falabrasil/lm-br.git
$ cd lm-br
$ ./run.sh

Docker 🐳

⚠️ We won't be pushing this image to dockerhub because SRILM's license might now allow it (??). Make sure to download SRILM yourself and place the *.tar.gz file under this repo's dir.

To build the Debian-slim-based image and check for SRILM's version, do the following:

$ cd lm-br  # from git clone
$ docker build -t falabrasil/lm-br:latest -f docker/Dockefile .
$ docker run --rm -it falabrasil/lm-br:latest bash
$ ngram -version

The output should be as follows:

SRILM release 1.7.3 (with third-party contributions)
Built with GCC 11.1.0
and options -g -O3 

Program version @(#)$Id: ngram-count.cc,v 1.81 2019/09/09 23:13:13 stolcke Exp $

Support for compressed files is included.
Using OpenMP version 201511.

This software is subject to the SRILM Community Research License Version
1.0 (the "License"); you may not use this software except in compliance
with the License.  A copy of the License is included in the SRILM root
directory in the "License" file.  Software distributed under the License
is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either
express or implied.  See the License for the specific language governing
rights and limitations under the License.

This software is Copyright (c) 1995-2019 SRI International.  All rights
reserved.

Portions of this software are
Copyright (c) 2002-2005 Jeff Bilmes
Copyright (c) 2009-2013 Tanel Alumae
Copyright (c) 2011-2019 Andreas Stolcke
Copyright (c) 2012-2019 Microsoft Corp.

SRILM also includes open-source software as listed in the
ACKNOWLEDGEMENTS file in the SRILM root directory.

If this software was obtained under a commercial license agreement with
SRI then the provisions therein govern the use of the software and the
above notice does not apply.

Notes 📝

Beware you'll probably need at least 32 GB of RAM on your machine. Sparing some swap space is also advised.
Evaluation is performed on the Portuguese portion of Mozilla's Common Voice dataset.
On Debian-based OS, pkg-config, libaspell-dev, and libicu-dev are dependencies to some Python modules. Check the Dockerfile to see the packages downloaded via apt-get.

Grupo FalaBrasil (2021) - https://ufpafalabrasil.gitlab.io/
Universidade Federal do Pará (UFPA) - https://portal.ufpa.br/
Cassio Batista - https://cassota.gitlab.io/

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docker		docker
local		local
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Model Training 🇧🇷

Usage

Docker 🐳

Notes 📝

About

Releases

Packages

Languages

License

falabrasil/lm-br

Folders and files

Latest commit

History

Repository files navigation

Language Model Training 🇧🇷

Usage

Docker 🐳

Notes 📝

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages