Skip to content

πŸ“° Fighting Fake News with machine learning! πŸ€– Using source-based classification to detect misinformation using TF-IDF + RandomForest vs Embeddings + CNN. πŸ”

License

Notifications You must be signed in to change notification settings

sergio11/fake_news_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“° Fighting Misinformation: Source-Based Fake News Classification πŸ•΅οΈβ€β™‚οΈ

Fake news spreads like wildfire in today’s fast-paced digital world 🌐, often distorting public perception and influencing opinions. As social media and online platforms become the primary sources of information for millions of people, the spread of misinformation has become a pressing concern ⚠️. Fake news can have serious consequences, from influencing elections πŸ—³οΈ to causing panic during crises.

This project focuses on classifying news by type and label using a source-based approach πŸ§‘β€πŸ’», which involves analyzing key source information such as the author ✍️, publication date πŸ“…, and the reputation of the source itself πŸ…. By leveraging structured data and machine learning techniques πŸ€–, we aim to create an automated system that can efficiently identify and classify news articles based on their authenticity βœ….

Our goal is to provide a reliable method for distinguishing between real and fake news πŸ”, contributing to the fight against misinformation. This system not only helps improve news credibility but also empowers users with the tools needed to critically assess the information they consume, promoting a more transparent and trustworthy online news environment 🌍.

Through this project, we hope to raise awareness of the importance of verifying sources βœ… and foster the development of applications that can assist in combating fake news, ultimately paving the way for more informed, educated, and responsible news consumption in the digital age πŸ“š.

πŸ™ I would like to extend my heartfelt gratitude to Santiago HernΓ‘ndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

🌟 Explore My Other Deep Learning Projects! 🌟

If you found this project intriguing, I invite you to check out my other cutting-edge deep learning initiatives:

How does social media respond to crises in real time? This project focuses on classifying tweets to determine if they’re related to disasters or not. Using deep learning and enriched datasets, I uncover insights into how people discuss disasters on platforms like Twitter.

Fraudulent transactions can cause immense financial losses. This project leverages Deep Neural Networks to detect fraud in financial data, even in highly imbalanced datasets. Learn about my process, from Exploratory Data Analysis (EDA) to building a scalable and efficient solution for fraud detection.

The Internet of Things (IoT) is transforming the world, but it also introduces new security challenges. This project leverages Deep Learning Neural Networks to classify and detect malware in IoT network traffic. By analyzing patterns with AI, it provides proactive cybersecurity solutions to safeguard interconnected devices. Explore the intricate process of model design and training with the Keras framework, detailed in the accompanying Jupyter Notebook.

This project uses a Bi-directional LSTM model πŸ“§πŸ€– to classify emails as spam or legitimate, utilizing NLP techniques like tokenization, padding, and stopword removal. It aims to create an effective email classifier πŸ’»πŸ“Š while addressing overfitting with strategies like early stopping 🚫.

πŸ§ πŸš€ AI-Powered Brain Tumor Classification: Leveraging Deep Learning with CNNs and Transfer Learning to classify brain tumors from MRI scans, enabling fast and accurate diagnostics. 🌐⚑

Take a dive into these projects to see how deep learning is solving real-world problems and shaping the future of AI applications. Let's innovate together! πŸš€

πŸ“Š About the Dataset

πŸ”Ž Context

Social media platforms are a treasure trove of content, with news being one of the most consumed categories. However, not all news is authentic. Fake news, whether posted by politicians, news outlets, or civilians, can have far-reaching consequences.

Challenges:

  • Manual classification of news is time-consuming and prone to bias.
  • Verifying authenticity remains a critical task in the fight against misinformation.

πŸ”’ Source

Published paper: Source-Based Fake News Classification

πŸ”§ Features

  • Preprocessed data from the Getting Real about Fake News dataset.
  • Eliminated skew for improved reliability.
  • Comprehensive inclusion of source information, including author names, publication dates, and labels.

πŸš€ Motivation

In an age where fake WhatsApp forwards and misleading Tweets influence public opinion, it’s crucial to develop tools to:

  • Mitigate the spread of misinformation.
  • Inform users about the nature of the news they consume.

This project’s inspiration lies in creating:

  1. Practical applications to analyze and classify news articles.
  2. Plugins and tools for easy access to fact-checking.
  3. Awareness campaigns about the consequences of consuming and spreading fake news.

🌟 Highlights

  • Source-Based Labeling: Ensures credibility by tracking the origin of news articles.
  • Automation: Reduces human bias in classification.
  • Informed Consumption: Helps users make smarter decisions about the news they trust.

βš–οΈ Comparison of Approaches

In this project, two machine learning approaches are evaluated for classifying fake news:

  1. RandomForestClassifier using TF-IDF
  2. Embeddings + CNN (Convolutional Neural Networks)

1. RandomForestClassifier using TF-IDF

  • TF-IDF (Term Frequency-Inverse Document Frequency) is a traditional text preprocessing technique that transforms text data into a high-dimensional sparse vector space. This method measures the importance of a word in a document relative to its frequency across all documents.

  • The RandomForestClassifier then uses this vectorized representation for classification. Random forests are an ensemble method that builds multiple decision trees and combines their outputs, typically resulting in a strong and reliable classifier.

    Pros:

    • Efficient and works well for smaller datasets.
    • Simple to implement and interpret.

    Cons:

    • The sparse representation of text doesn’t capture the semantic meaning of words or their contextual relationships.
    • May struggle with large datasets or when the relationships between words are complex.

2. Embeddings + CNN (Convolutional Neural Networks)

  • Embeddings are dense, lower-dimensional vector representations of words that capture their semantic meaning. By mapping words with similar meanings closer together in a vector space, embeddings provide more context and depth compared to traditional vectorization methods like TF-IDF.

  • The CNN architecture is well-suited for text classification tasks. In this case, convolutional layers capture local patterns in the text, and pooling layers help reduce dimensionality. CNNs can learn more abstract and hierarchical features from text, which is useful in identifying subtle patterns and relationships that might indicate whether news is fake or real.

    Pros:

    • Better at capturing semantic relationships and context of words.
    • Suitable for large and complex datasets with nuanced patterns.
    • Can provide higher performance in text classification tasks.

    Cons:

    • Requires larger datasets for training.
    • Needs more computational resources and may take longer to train.

Model Evaluation

  • Both approaches were trained and evaluated on the Getting Real about Fake News dataset.
  • The RandomForestClassifier using TF-IDF showed decent performance for basic tasks but struggled to capture deeper semantic meaning and context.
  • The Embeddings + CNN approach outperformed the traditional method in both training and testing accuracy, as it was able to better capture the relationships between words and classify news more effectively.

Conclusion

The results of this comparison highlight the advantages of using Embeddings + CNN for more complex text classification tasks, especially in dealing with large, high-dimensional datasets. However, RandomForestClassifier using TF-IDF remains a useful and simpler tool for tasks where computational resources or training data are limited. This project shows that using a source-based approach combined with machine learning techniques can effectively aid in the detection of fake news.

✨ Let’s Fight Fake News Together! πŸ•΅οΈβ€β™‚οΈ

πŸ™ Acknowledgments

  • Dataset: Getting Real about Fake News
    • Selected for its detailed inclusion of source information, crucial for verifying authenticity.
  • Special thanks to the creators and contributors of this dataset for enabling research in combating misinformation.

A huge thank you to ruchi798 for providing the dataset that made this project possible! 🌟 The dataset can be found on Kaggle. Your contribution is greatly appreciated! πŸ™Œ

πŸ™ I would like to extend my heartfelt gratitude to Santiago HernΓ‘ndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

Visitors Count

Please Share & Star the repository to keep me motivated.

License βš–οΈ

This project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. πŸ› οΈ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. πŸ“„

Please note the following limitations:

  • The software is provided "as is", without any warranties, express or implied. πŸš«πŸ›‘οΈ
  • If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. πŸ“‘
  • The license allows for commercial use, but you cannot claim ownership over the software itself. 🏷️

The goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.

MIT License

Copyright (c) 2024 Dream software - Sergio SΓ‘nchez 

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
``

About

πŸ“° Fighting Fake News with machine learning! πŸ€– Using source-based classification to detect misinformation using TF-IDF + RandomForest vs Embeddings + CNN. πŸ”

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published