Skip to content

Many countries speak Arabic; however, each country has its own dialect, the aim of this repo is to build a model that predicts the dialect given the text.

Notifications You must be signed in to change notification settings

MohamedNabill7/Arabic-Dialect-Prediction

Repository files navigation

Arabic-Dialect-Prediction

Many countries speak Arabic. However, each country has its own dialect, the aim of this repo is to build a model that predicts the dialect given the text.

App

sc

Introduction

The process of computationally identifying the language of a given text is considered the cornerstone of many important NLP applications such as machine translation, social media analysis, etc.

Dataset collected of tweets belonging to a wide range of country level Arabic dialects covering 18 different countries in the Middle East and North Africa region, which initially consists of 2 columns id and dialect.

Dataset Here

Abstractly, the Project is divided into 4 steps.

  • Data Fetching:

    • Fetches the tweets from an API using the given ids.
  • Data Pre-processing:

    • Cleaning
    • Normalization
    • Tokenization
    • Labelling
  • Modeling

    • In ML : (Linear SVC & Multinomial naive bayes) models trained using CountVectorizer features and transform using TfidfTransformer.
    • In DL : Using LSTM as it deal very well with Text data which have long sequences. It tooks hours just to finish one epoch.
  • Deployment

    • Deploying the LinearSVC model using Flask as a back end and HTML as front end.

Result

The macro-averaged F1 score (or macro F1 score) is computed by taking the arithmetic mean (aka unweighted mean) of all the per-class F1 scores. This method treats all classes equally regardless of their support values. working with an imbalanced dataset where all classes are equally important, using the macro average would be a good choice as it treats all classes equally.

        ------------------------------------------------------------
       |     Model                      |   5-fold cv   |  F1 Score |
       |------------------------------------------------|-----------|
       | LinearSVC     (dialect)        |    with       |    0.883  |
       |------------------------------------------------|-----------|
       | MultinomialNB (dialect)        |    with       |    0.765  |
       |------------------------------------------------|-----------|
       | LinearSVC     (dialect)        |    without    |    0.841  |
       |------------------------------------------------|-----------|          
       | MultinomialNB (dialect)        |    without    |    0.704  |
       |------------------------------------------------|-----------|    
       | LinearSVC (region_dialect)     |    with       |    0.788  |
       |------------------------------------------------|-----------|    
       | MultinomialNB (region_dialect) |    with       |    0.701  |
       |------------------------------------------------|-----------|
       | LinearSVC (region_dialect)     |    without    |    0.802  |
       |------------------------------------------------|-----------|    
       | MultinomialNB (region_dialect) |    without    |    0.685  |
       |------------------------------------------------|-----------|    
       | LSTM                           |    ------     |    0.875  |
        ------------------------------------------------------------ 

How to Run

- Clone this repo to your local machine.
- Due to github maximum file size i couldn't upload model to the repo but you can download  
- Run the following command "pip install requirements.txt".
- Run the following command "python predictor_app.py".

Download Model from Here

Tools Used

  • Python 3.7
  • Jupyter-lab or Google Colab

Conclusions

  • Our model suffered from class imbalance. Dialectical classification needs more data to work better.

Next Steps

  • Finetuning of Arabert

About

Many countries speak Arabic; however, each country has its own dialect, the aim of this repo is to build a model that predicts the dialect given the text.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published