Department of Computer Science
Indian Institute of Technology Bombay
(2023-2025)
PROJECT REPORT
ON
“FAKE NEWS AND HATE SPEECH DETECTION WITH MACHINE LEARNING AND NLP”
Under the guidance of
Sunita Sarwangi
Submitted by:-
Shariq Faraz 23M0779 Aratrik Chandra 23M0786 Goutam Layek 23M0776 Sayantan Biswas 23M0804 Ankush Mitra 23M0755 Kesaba Trilochan Panda 23M0772
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to my supervisor ….. for providing their invaluable guidance, comments and suggestions throughout the flow of the project. We deeply express our sincere thanks to our Department of CSE for encouraging and allowing us to present the project on the topic “ FAKE NEWS AND HATE SPEECH DETECTION”
We take this opportunity to thank all our TAs who have directly or indirectly helped our project.
Last but not the least we express our thanks to our friends for their cooperation and support.
Introduction
In an age of proliferation in digital channels, the rise of fake news poses a formidable challenge to the integrity of the live news ecosystem and detecting and mitigating the impact of misleading news has become a task which is important.
This work examines the dynamic interplay between hate speech detection and fake news classification, and discovers the relationship between these processes. Inspiration comes from the realization that hate speech, which is often the face of misinformation, can influence the creation and dissemination of fake news By understanding the relationship between these two elements
The project uses two separate data sets, one designed for hate speech detection and the other for fake news classification. The method is characterized by both major steps: first, in the hostile data group, BERT (Bidirectional encoder representations from Transformers) Model Training, Second, Integration of this Modal production of the production of this model is the base of the three trains of the specimens.
RELATED WORK
Research has received considerable attention on identifying and mitigating the effects of fake news and hate speech in a dynamic environment of misinformation and deception Literature has published approaches ranging from lexical feature extraction to machine learning as it goes to the depths. Several studies have used methods based on natural language processing (NLP) and deep learning to address the multifaceted challenges posed by misleading information.
Pinkesh Badjatia and others. (2017), Aditya Gaidhani (2018), and Ziki Zhang (2018) examine methods of lexical feature extraction, leveraging bag-of-words representation, and distinguishing between linguistic features for better detection of deceptive statements Incorrect that, Fazal Masud Kundi (2014) emphasizes emotional analysis to identify hate speech, the complexity of the underlying issues It acknowledges the limitations of relying solely on emotions to catch them.
In 2019, the work of Ankesh Anand introduced the concept of clickbait detection as a method to identify fraudulent titles using both NLP and deep learning techniques The research uses deep learning such as recurrent neural networks (RNN), long and short term memory (LSTM), and gated repetition units. highlight the effectiveness of the algorithms
Marina Danchovsky Ibrishimova (2020) introduces a novel approach to fake news detection, focusing on the prevalence of proper nouns. The study suggests that the ratio of proper nouns to other nouns can serve as a credibility indicator, albeit acknowledging the necessity of rigorous fact-checking.
According to the literature, the field is characterized by a rich set of methods and techniques, reflecting the severity of the challenge posed by fraudulent claims This literature review lays the foundation for our research , using machine learning models such as BERT, Random Forest, and SVM . Through this synthesis, we aim to provide new perspectives on the relationship between hate speech and fake news, enhance our understanding, and propose strategies to combat digital content the intensity of the injustice.
OBJECTIVE
The overall objective of this project is to investigate and evaluate the interplay between hate speech detection and fake news classification, and to identify the relationship between these two manifestations of deceptive digital content. The specific objectives are:
Methods for identifying hate speech to integrate:
Train a state-of-the-art BERT (Bidirectional Encoder Representations from Transformers) model on a dedicated hostile speech dataset to extract nuanced insights into potentially harmful speech patterns. The richness of the fake news dataset:
Integrate the results of the hate speech detection model with the fake news data set, and create an overview that includes both the hate speech detection method and the key elements of fake news data Sentence mapping for context understanding:
Map specific phrases to hate speech detection results, increasing contextual understanding of how hate speech affects narrative and fake news content. Training and comparison example:
Train different machine learning algorithms—BERT, Random Forest, and Support Vector Machine (SVM)—that use rich data to understand the impact of hate speech detection on the effectiveness of fake news classification the bottom of the table. Performance Analysis Criteria:
Use rigorous evaluation criteria including precision, accuracy, recall, and F1 scores to quantitatively assess the performance of each model and identify any noticeable improvements in the distribution of false positives Analysis generation and interpretation:
Extract insights into the dynamics of hate speech and its impact on the generation and spread of fake news. Interpret the results to understand the subtle relationship between detection of hate speech and classification of fake news.
Offer recommendations for further research and improvements in methodologies, aiming to advance the collective understanding of strategies for combating deceptive digital content.
By pursuing these objectives, this project aspires to contribute valuable insights to the broader domain of misinformation detection and fortify the collective efforts aimed at fostering information integrity in digital spaces.
Dataset Overview
Textual Content: The data set consists primarily of textual content, including sentences, phrases, or short paragraphs. This textual content forms the basis for training and testing machine learning models to detect hate speech.
Hate speech and offensive language to label: Patterns in the dataset are listed to indicate whether they contain hate speech or offensive language. These two classifications provide the basis for a supervised learning task, where the model learns to distinguish between harmful and harmless substances.
Differences in Language and Presentation Given the nature of hate speech and offensive speech, the dataset can reveal a variety of speech patterns, vulgarities, and potentially harmful features This diversity is important for robust models that can handle misleading speech the solution of the.
Extensive data set: In each case, the size of the data set, including the number of instances and the length of the text, influences the training and evaluation process. Sufficiently large data sets are needed for training models that adequately address language models.
Sources of information: The method of data collection sources and collections, although available on Kaggle, can include information about how the samples were written, any guidelines followed during the identification process, and any biases if possible available in the data structure
Data Preprocessing
Outline the preprocessing steps for both datasets Cleaning, handling missing values, tokenization, etc.
Hate Speech Detection Model Explain the training process for the BERT model on the hate speech dataset Highlight any challenges or unique aspects
Fake News Classification
Describe how the trained BERT model is used to classify the fake news dataset Mention the creation of a new dataset with classifications
New Dataset Overview
Present key statistics and characteristics of the new dataset Show any distribution of classes, if relevant
Model Training Briefly explain the training process for BERT, Random Forest Classifier, and SVM Mention key parameters, features, and tuning
RANDOM FOREST - PART 1
CONFUSION MATRIX
Precision recall f1-score support 0 0.88 0.96 0.92 107 1 0.95 0.85 0.90 93 accuracy
0.91 200 macro avg 0.92 0.91 0.91 200 Weighted avg 0.91 0.91 0.91 200
Accuracy : 91.25% f1_score : 0.90
RANDOM FOREST - PART 2
CONFUSION MATRIX
Precision recall f1-score support 0 0.84 0.96 0.90 107 1 0.95 0.78 0.86 93 accuracy
0.88 200 macro avg 0.89 0.87 0.88 200 Weighted avg 0.89 0.88 0.88 200
Accuracy : 88.0% f1_score : 0.879
SVM PART - 1
CONFUSION MATRIX
Precision recall f1-score support 0 0.98 0.96 0.97 1589 1 0.95 0.97 0.96 1411 accuracy
0.96 3000 macro avg 0.96 0.97 0.96 3000 Weighted avg 0.96 0.97 0.96 3000
Accuracy : 96.47% f1_score : 0.9628 Precision:0.9534 Recall:0.9723
SVM PART - 2
CONFUSION MATRIX
Precision recall f1-score support 0 0.94 0.95 0.94 160 1 0.94 0.93 0.94 140 accuracy
0.94 300 macro avg 0.94 0.94 0.94 300 Weighted avg 0.94 0.94 0.94 300
Accuracy : 94% f1_score : 0.935 Precision:0.928 Recall:0.928
Model Comparison Compare the performance of BERT, Random Forest Classifier, and SVM Discuss strengths and weaknesses of each model Page 12:
Challenges and Solutions Discuss any challenges faced during the project Explain how you overcame them or mitigated their impact
Future Work Propose potential enhancements or extensions to the project Suggest areas for further research
Conclusion Summarize the key findings and takeaways Reiterate the significance of your work
Acknowledgments and References