Many countries speak Arabic. However, each country has its own dialect, the aim of this repo is to build a model that predicts the dialect given the text.
The process of computationally identifying the language of a given text is considered the cornerstone of many important NLP applications such as machine translation, social media analysis, etc.
Dataset collected of tweets belonging to a wide range of country level Arabic dialects covering 18 different countries in the Middle East and North Africa region, which initially consists of 2 columns id and dialect.
Dataset Here
-
Data Fetching:
- Fetches the tweets from an API using the given ids.
-
Data Pre-processing:
- Cleaning
- Normalization
- Tokenization
- Labelling
-
Modeling
- In ML : (Linear SVC & Multinomial naive bayes) models trained using CountVectorizer features and transform using TfidfTransformer.
- In DL : Using LSTM as it deal very well with Text data which have long sequences. It tooks hours just to finish one epoch.
-
Deployment
- Deploying the LinearSVC model using Flask as a back end and HTML as front end.
The macro-averaged F1 score (or macro F1 score) is computed by taking the arithmetic mean (aka unweighted mean) of all the per-class F1 scores. This method treats all classes equally regardless of their support values. working with an imbalanced dataset where all classes are equally important, using the macro average would be a good choice as it treats all classes equally.
------------------------------------------------------------
| Model | 5-fold cv | F1 Score |
|------------------------------------------------|-----------|
| LinearSVC (dialect) | with | 0.883 |
|------------------------------------------------|-----------|
| MultinomialNB (dialect) | with | 0.765 |
|------------------------------------------------|-----------|
| LinearSVC (dialect) | without | 0.841 |
|------------------------------------------------|-----------|
| MultinomialNB (dialect) | without | 0.704 |
|------------------------------------------------|-----------|
| LinearSVC (region_dialect) | with | 0.788 |
|------------------------------------------------|-----------|
| MultinomialNB (region_dialect) | with | 0.701 |
|------------------------------------------------|-----------|
| LinearSVC (region_dialect) | without | 0.802 |
|------------------------------------------------|-----------|
| MultinomialNB (region_dialect) | without | 0.685 |
|------------------------------------------------|-----------|
| LSTM | ------ | 0.875 |
------------------------------------------------------------
- Clone this repo to your local machine.
- Due to github maximum file size i couldn't upload model to the repo but you can download
- Run the following command "pip install requirements.txt".
- Run the following command "python predictor_app.py".
Download Model from Here
- Python 3.7
- Jupyter-lab or Google Colab
- Our model suffered from class imbalance. Dialectical classification needs more data to work better.
- Finetuning of Arabert