Vietnamese-Text-Classification

A project for text classification in Vietnamese using the VNews8td dataset. The dataset and this repository are designed for document classification tasks.

Dataset: VNews8td

The VNews8td dataset is a Vietnamese text classification dataset collected from VnExpress from 01/06/2023 to 01/06/2024. It contains articles categorized into eight classes, with each article including a title and a description.

Dataset Details:

Classes:
- doisong (Đời sống)
- giaoduc (Giáo dục)
- khoahoc (Khoa học)
- kinhte (Kinh tế)
- suckhoe (Sức khỏe)
- thegioi (Thế giới)
- thethao (Thể thao)
- thoisu (Thời sự)
Splits:
- Training Set: 70%
- Validation Set: 10%
- Test Set: 20%

Dataset Link:

Download the dataset

Project Overview

This project implements a text classification pipeline for Vietnamese documents using the VNews8td dataset. The primary steps include:

Preprocessing Vietnamese text using VnCoreNLP.
Extracting features with TF-IDF.
Training models such as Logistic Regression for classification.
Evaluating model performance with metrics like accuracy and classification reports.

Installation

Clone the repository:

git clone https://github.com/thanghd1112/Vietnamese-Text-Classification.git
cd Vietnamese-Text-Classification

Install required packages:
```
pip install -r requirements.txt
```

Download and set up VnCoreNLP:

wget https://github.com/vncorenlp/VnCoreNLP/archive/refs/heads/master.zip
unzip master.zip
rm -f master.zip
mv VnCoreNLP-master VnCoreNLP

Usage

Preprocess and prepare the dataset:
- Load the data from the provided link.
- Split the data into training, validation, and test sets.
Train the model:
- Use the Jupyter Notebook provided (VN_TextClassification.ipynb) to preprocess, train, and evaluate your model.
Evaluate:
- Check accuracy and other metrics on the test set.

Contributing

Contributions are welcome! Feel free to submit a pull request or raise an issue if you encounter problems.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
VN_TextClassification.ipynb		VN_TextClassification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese-Text-Classification

Table of Contents

Dataset: VNews8td

Dataset Details:

Dataset Link:

Project Overview

Installation

Usage

Contributing

About

Releases

Packages

Languages

thanghd1112/Vietnamese-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Vietnamese-Text-Classification

Table of Contents

Dataset: VNews8td

Dataset Details:

Dataset Link:

Project Overview

Installation

Usage

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages