A project for text classification in Vietnamese using the VNews8td dataset. The dataset and this repository are designed for document classification tasks.
The VNews8td dataset is a Vietnamese text classification dataset collected from VnExpress from 01/06/2023 to 01/06/2024. It contains articles categorized into eight classes, with each article including a title and a description.
-
Classes:
doisong
(Đời sống)giaoduc
(Giáo dục)khoahoc
(Khoa học)kinhte
(Kinh tế)suckhoe
(Sức khỏe)thegioi
(Thế giới)thethao
(Thể thao)thoisu
(Thời sự)
-
Splits:
- Training Set: 70%
- Validation Set: 10%
- Test Set: 20%
This project implements a text classification pipeline for Vietnamese documents using the VNews8td dataset. The primary steps include:
- Preprocessing Vietnamese text using VnCoreNLP.
- Extracting features with TF-IDF.
- Training models such as Logistic Regression for classification.
- Evaluating model performance with metrics like accuracy and classification reports.
-
Clone the repository:
git clone https://github.com/thanghd1112/Vietnamese-Text-Classification.git cd Vietnamese-Text-Classification
-
Install required packages:
pip install -r requirements.txt
-
Download and set up VnCoreNLP:
wget https://github.com/vncorenlp/VnCoreNLP/archive/refs/heads/master.zip unzip master.zip rm -f master.zip mv VnCoreNLP-master VnCoreNLP
-
Preprocess and prepare the dataset:
- Load the data from the provided link.
- Split the data into training, validation, and test sets.
-
Train the model:
- Use the Jupyter Notebook provided (
VN_TextClassification.ipynb
) to preprocess, train, and evaluate your model.
- Use the Jupyter Notebook provided (
-
Evaluate:
- Check accuracy and other metrics on the test set.
Contributions are welcome! Feel free to submit a pull request or raise an issue if you encounter problems.