This is a prototype of language detection for short message service (twitter). with 99.1% accuracy for 17 languages
-
Extract model directory tar xf models/[select model archive]
-
Detect ldig.py -m [model directory] [text data file]
As input data, Each tweet is one line in text file as the below format.
[label]\t[some metadata separated '\t']\t[text without '\t']
[label] is a language name alike en, de, fr and so on. It is also optional as metadata. (ldig doesn't use metadata and label for detection, of course :D)
The output data of lidg is as the below.
[correct label]\t[detected label]\t[original metadata and text]
ldig has a estimation tool.
./server.py -m [model directory]
Open http://localhost:48000 and input target text into textarea. Then ldig outputs language probabilities and feature parameters in the text.
- cs Czech
- da Dannish
- de German
- en English
- es Spanish
- fi Finnish
- fr French
- id Indonesian
- it Italian
- nl Dutch
- no Norwegian
- pl Polish
- pt Portuguese
- ro Romanian
- sv Swedish
- tr Turkish
- vi Vietnamese
-
Blog Articles about ldig
- (c)2011-2012 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
- All codes and resources are available under the MIT License.