Text Classification (Berita Harian)

This project used dataset scraped from Berita Harian site. About 32000 articles were scraped and splitted into training and testing datasets.

Requirements

Sastrawi
scikit-learn (sklearn)
pickle
numpy
matplotlib
seaborn
pandas

Installation

Using pip: pip install Sastrawi scikit-learn pickle numpy matplotlib seaborn pandas Using conda: conda install Sastrawi scikit-learn pickle numpy matplotlib seaborn pandas

Note: Additional module may be required for those who are using Jupyter Notebook (to display graphics) but I don't actually used it.

Scrapping

Using search feature in Berita Harian's website, I search through all permutations of A-Z of length 3 as such:

letters = 'abcdefghijklmnopqrstuvwxyz'
for i in letters:
	for j in letters:
		for k in letters:
			crawl(query=''.join([i, j, k]))

I go through 21 pages for each query since the web apparently stop at a certain page and if proceeded will repeated the previous page.

Note: You could use multiple thread to fasten up the scrape process

Model training

Model were create and 15% of at least 2000 articles was pre-trained and stored in the "Models" folder. The model can be loaded using pickle and trained using gbc.fit(features_train, labels_train). You can see how it's done in 'train.py' Python script.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Models		Models
Pickles		Pickles
.gitignore		.gitignore
README.md		README.md
Step1.ipynb		Step1.ipynb
Step2.ipynb		Step2.ipynb
Step3 - Gradient Boosting Model.ipynb		Step3 - Gradient Boosting Model.ipynb
df.pickle		df.pickle
main.py		main.py
predict.py		predict.py
scrap.py		scrap.py
table.py		table.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification (Berita Harian)

Requirements

Installation

Scrapping

Model training

About

Releases

Packages

Languages

Intern-Media-Prima/Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Text Classification (Berita Harian)

Requirements

Installation

Scrapping

Model training

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages