YouTube Predictor API built on top of Rahul Kapoor's Clickbait Detector
- Python 2.7.12
- Keras 1.2.1
- Tensorflow 0.12.1
- Numpy 1.11.1
- NLTK 3.2.1
Install a virtualenv in the project directory
virtualenv venv
Activate the virtualenv
On Windows:
cd venv/Scripts activate
On Linux
source venv/bin/activate
Install the requirements
pip install -r requirements.txt
Try it out! Try running one of the examples.
Training Accuracy after 25 epochs = 93.8 % (loss = 0.1484)
Validation Accuracy after 25 epochs = 90.15 % (loss = 0.2670)
$ python src/ "Novak Djokovic stunned as Australian Open title defence ends against Denis Istomin"
Using TensorFlow backend.
headline is 0.33 % clickbaity
$ python src/ "Just 22 Cute Animal Pictures You Need Right Now"
Using TensorFlow backend.
headline is 85.38 % clickbaity
$ python src/ " 15 Beautifully Created Doors You Need To See Before You Die. The One In Soho Blew Me Away"
Using TensorFlow backend.
headline is 52.29 % clickbaity
$ python src/ "French presidential candidate Emmanuel Macrons anti-system angle is a sham | Philippe Marlire"
Using TensorFlow backend.
headline is 0.05 % clickbaity
Layer (type) Output Shape Param # Connected to
embedding_1 (Embedding) (None, 20, 30) 195000 embedding_input_1[0][0]
convolution1d_1 (Convolution1D) (None, 19, 32) 1952 embedding_1[0][0]
batchnormalization_1 (BatchNorma (None, 19, 32) 128 convolution1d_1[0][0]
activation_1 (Activation) (None, 19, 32) 0 batchnormalization_1[0][0]
convolution1d_2 (Convolution1D) (None, 18, 32) 2080 activation_1[0][0]
batchnormalization_2 (BatchNorma (None, 18, 32) 128 convolution1d_2[0][0]
activation_2 (Activation) (None, 18, 32) 0 batchnormalization_2[0][0]
convolution1d_3 (Convolution1D) (None, 17, 32) 2080 activation_2[0][0]
batchnormalization_3 (BatchNorma (None, 17, 32) 128 convolution1d_3[0][0]
activation_3 (Activation) (None, 17, 32) 0 batchnormalization_3[0][0]
maxpooling1d_1 (MaxPooling1D) (None, 1, 32) 0 activation_3[0][0]
flatten_1 (Flatten) (None, 32) 0 maxpooling1d_1[0][0]
dense_1 (Dense) (None, 1) 33 flatten_1[0][0]
batchnormalization_4 (BatchNorma (None, 1) 4 dense_1[0][0]
activation_4 (Activation) (None, 1) 0 batchnormalization_4[0][0]
Total params: 201,533
Trainable params: 201,339
Non-trainable params: 194
The dataset consists of about 12,000 headlines half of which are clickbait. The clickbait headlines were fetched from BuzzFeed, NewsWeek, The Times of India and, The Huffington Post. The genuine/non-clickbait headlines were fetched from The Hindu, The Guardian, The Economist, TechCrunch, The wall street journal, National Geographic and, The Indian Express.
Some of the data was from peterldowns's clickbait-classifier repository
I used Stanford's Glove Pretrained Embeddings PCA-ed to 30 dimensions. This sped up the training.
To improve Accuracy,
- Increase Embedding layer dimension (Currently it is 30) -
- Use more data
- Increase vocabulary size -
- Increase maximum sequence length -
- Do better data cleaning