The Urdu Sarcasm Detection App is an AI-powered web application that expertly analyzes Urdu text for sarcasm. In today's digital landscape, where sentiment analysis is crucial for understanding user feedback and content, this project addresses the complex challenge of sarcasm detection—often nuanced and context-dependent. By equipping users with a tool to identify sarcastic comments, this application enhances communication and boosts user engagement across various platforms.
Follow these steps to set up the project:
-
Clone the repository:
git clone https://github.com/saadsohail05/urdu-sarcasm-detection.git cd urdu-sarcasm-detection
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required libraries:
pip install -r requirements.txt
-
Download additional resources: Ensure you have the following files in the project directory:
urdu_sentiment_model.pkl
(the trained model)urdu_sentiment_tfidf.pkl
(the TF-IDF vectorizer)stopwords-ur.txt
(list of Urdu stopwords)Dictionary_final.csv
(Urdu lemma dictionary)
To run the application:
-
Start the Streamlit app:
streamlit run app.py
-
Open your web browser and navigate to
http://localhost:8501
. -
Enter an Urdu sentence in the input box and click the Submit button to analyze the text for sarcasm.
User Input: "تم تو بہت اچھے ہو"
Prediction: Non-Sarcastic
- Sarcasm Detection: Analyze Urdu text for sarcasm using a trained machine learning model.
- Preprocessing Pipeline: Includes text normalization, emoji removal, stopwords filtering, and lemmatization.
- User-Friendly Interface: A modern and engaging web interface powered by Streamlit.
- Visual Feedback: Displays predictions with color-coded feedback for enhanced user experience.
The project utilizes a variety of datasets:
- Stopwords: A comprehensive list of Urdu stopwords sourced from
stopwords-ur.txt
andurdu_stopwords.csv
. - Lemma Dictionary: The
Dictionary_final.csv
contains Urdu words and their respective lemmas used for text normalization.
The application employs several techniques:
- Text Preprocessing: Cleans and prepares text for analysis through functions that remove numbers, emojis, hashtags, mentions, URLs, and stopwords.
- Machine Learning Model: A Gaussian Naive Bayes classifier trained on Urdu text data to predict sarcasm.
- TF-IDF Vectorization: Converts processed text into a vectorized format for model input.
The model has demonstrated promising results in detecting sarcasm in Urdu text. The accuracy of the model can be evaluated using metrics such as the confusion matrix and classification report.
The Urdu Sarcasm Detection App effectively tackles the challenge of identifying sarcasm in Urdu text, offering users valuable insights into their content's nature. This application has applications in various domains, including social media monitoring, customer feedback analysis, and beyond.
Potential improvements and extensions for the project include:
- Expanding the dataset for improved model accuracy.
- Integrating additional languages for sarcasm detection.
- Enhancing the user interface with more interactive features.
- Implementing real-time sentiment analysis in social media platforms.
Contributions are welcome! If you would like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or fix.
- Make your changes and commit them.
- Submit a pull request with a detailed description of your changes.
This project is licensed under the MIT License. See the LICENSE file for details.