π¦ Pre-trained models on GitLab (LFS): https://gitlab.com/fb-resources/lm-br
Scripts to train n-gram language models in ARPA format, currently using only SRILM. Based on Kaldi scripts for LibriSpeech (local/lm/train_lm.sh).
It also generates a list of top-N most frequent, estimated from OSCAR.
A demonstration is performed over five files of the first version (2019) of the OSCAR corpus. Raw data sums up to 8.0 GB while after normalisation clean data is 6.25 GB.
$PATH
beforehand.
$ git clone https://github.com/falabrasil/lm-br.git
$ cd lm-br
$ ./run.sh
*.tar.gz
file under this repo's dir.
To build the Debian-slim-based image and check for SRILM's version, do the following:
$ cd lm-br # from git clone
$ docker build -t falabrasil/lm-br:latest -f docker/Dockefile .
$ docker run --rm -it falabrasil/lm-br:latest bash
$ ngram -version
The output should be as follows:
SRILM release 1.7.3 (with third-party contributions)
Built with GCC 11.1.0
and options -g -O3
Program version @(#)$Id: ngram-count.cc,v 1.81 2019/09/09 23:13:13 stolcke Exp $
Support for compressed files is included.
Using OpenMP version 201511.
This software is subject to the SRILM Community Research License Version
1.0 (the "License"); you may not use this software except in compliance
with the License. A copy of the License is included in the SRILM root
directory in the "License" file. Software distributed under the License
is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either
express or implied. See the License for the specific language governing
rights and limitations under the License.
This software is Copyright (c) 1995-2019 SRI International. All rights
reserved.
Portions of this software are
Copyright (c) 2002-2005 Jeff Bilmes
Copyright (c) 2009-2013 Tanel Alumae
Copyright (c) 2011-2019 Andreas Stolcke
Copyright (c) 2012-2019 Microsoft Corp.
SRILM also includes open-source software as listed in the
ACKNOWLEDGEMENTS file in the SRILM root directory.
If this software was obtained under a commercial license agreement with
SRI then the provisions therein govern the use of the software and the
above notice does not apply.
- Beware you'll probably need at least 32 GB of RAM on your machine. Sparing some swap space is also advised.
- Evaluation is performed on the Portuguese portion of Mozilla's Common Voice dataset.
- On Debian-based OS,
pkg-config
,libaspell-dev
, andlibicu-dev
are dependencies to some Python modules. Check the Dockerfile to see the packages downloaded viaapt-get
.
Grupo FalaBrasil (2021) - https://ufpafalabrasil.gitlab.io/
Universidade Federal do ParΓ‘ (UFPA) - https://portal.ufpa.br/
Cassio Batista - https://cassota.gitlab.io/