Skip to content

Dataset and benchmarks for Language identification and content detection for low-resource languages

Notifications You must be signed in to change notification settings

bdu-birhanu/LID_TCD

Repository files navigation

LID_TCD

Dataset for Ethiopian language identification and topic classification

  • This datset consists of 22,624 texts labled for two tasks:

    - Language identification: this task is used to identify the lanaguage a give text written in.
    - Topic classification: this task is also useful to classify the topics of a given text according to its content.
    

To run the code with Terminal use the following info.

# Load and Pre-process data
python preprocess.py

# Train
python train.py

# Test and results
python test.py

Some issues to know

  1. The test environment is
    • Python 3.5.2
    • Keras 2.3.1
    • tensorflow 2.1.0

=======

About

Dataset and benchmarks for Language identification and content detection for low-resource languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published