The OCaml ecosystem currently has no language identification package counterpart similar to langid in python. As two students interested in NLP and translation, we implement langid for OCaml.
Given a string, detect the language in which it is written, and output the confidence of the most probable(or top-n most probable) languages, according to the model. We use this library in a fun command-line game that pits the user's language guessing abilities against the LangID model.
In an effort to be consistent with existing open-source packages, and make our package readily useable by those who may be familar with the popular python instantiation, we model our interface off this existing open-source library.
models/fst_feature_model_info.json
models/nb_ptc.npy
models/nb_pc.npy
Run:
dune build
opam install .
NOTE FOR MAC USERS: You might need to brew install ssl
and brew install open-blas
. Then follow the export instructions to get the dependencies to install
-
Replace the variable
working_dir_path
intests/tests.ml
with your (absolute) working path.Ex:
let working_dir_path = /Users/daniel/Documents/ocaml-langid
-
Run
dune test
dune build
will build the executable langid.exe
at _build/default/src/langid.exe
. This can be run for either simple text evaluation of a string, or for our langid game.
./_build/default/src/langid.exe` -mode <string> -top_n <int> -input <string> [-h;-help]
Arguments:
-mode
- eithergame
oreval
. Defaults toeval
if flag not provided-top_n
- number of predicted languages to output. Defaults to3
if not provided.-input
- if "eval" mode used, string to predict language. Defaults to""
if not provided
./_build/default/src/langid.exe -mode eval -top_n 3 -input "Earth is beautiful with a bright blue sky and green trees
./_build/default/src/langid.exe -mode game
The sampler used to generate sentences for the game draws random sample sentences from Wikipedia in any of the possible target languages of the model. We would prefer to switch this paradigm to one in which a random english sentence is generated, and then translated to a target language to be output and for the model to run inference on. The Google Translate API requires authentication, which is not ideal for a program of our style, and we encountered issues with other open-source APIs. The random sentence initializer is also a hurdle, as calls to generative models often require authentication as well. We're actively looking for better options.