This project implements various text retrieval and relevance ranking models for analyzing a large dataset of news articles. The assignment is split into two parts:
- Part I focuses on parsing text data, building vocabulary, and ranking documents based on different models such as Bit Vector and TF-IDF.
- Part II introduces more advanced techniques using Word2Vec for document relevance.
The project is implemented in two Python files:
textretrieval.py
: Contains implementations for Tasks 1-3 (Text Parsing, Bit Vector Model, TF-IDF Model).Word2Vec-TFDF.py
: Contains implementations for Task 4 (Word2Vec) and Task 5 (extra credit).
The dataset used for this project is AG's News Topic Classification Dataset, specifically the test.csv
file from this link. The dataset contains news articles, each with a title, description, and class. For this project, we focus on the "description" field.
The project is built using Python 3, and the following libraries are required:
- Pandas
- NumPy
- NLTK
- Gensim (for Word2Vec)
To install the required dependencies, run:
pip install -r requirements.txt
-
Task 1: Text Data Parsing and Vocabulary Selection
- Cleans and preprocesses the dataset by removing stop-words, punctuation, numbers, HTML tags, and excess whitespaces.
- Builds a vocabulary of the top 200 most frequent words from the dataset.
-
Task 2: Document Relevance with Bit Vector Model
- Implements a basic Vector Space Model (VSM) using a bit-vector representation.
- Computes relevance scores for documents based on the query.
-
Task 3: Document Relevance with TF-IDF Model
- Implements the TF-IDF model using Okapi-BM25 (without document length normalization).
- Ranks documents based on the relevance to the provided queries.
-
Task 4: Document Relevance with Word2Vec
- Uses Word2Vec to compute word relevance based on pre-trained word embeddings.
- Scores documents using the average log-likelihood of Word2Vec embeddings.
-
Task 5 (Extra Credit): TF-IDF with Document Length Normalization
- Extends the TF-IDF model from Task 3 by adding document length normalization.
To execute the text parsing and relevance models:
-
Run
textretrieval.py
for Tasks 1-3:python textretrieval.py
-
Run
Word2Vec-TFDF.py
for Tasks 4-5:python Word2Vec-TFDF.py
Each script will process the dataset and output the top 5 most relevant documents and the bottom 5 least relevant documents based on the provided queries.
The following queries are used to test the models:
- Query 1: "olympic gold athens"
- Query 2: "reuters stocks friday"
- Query 3: "investment market prices"
The results for each query are printed to the console.
For each model, the output includes:
- Top 5 most relevant documents.
- Bottom 5 least relevant documents.
- Relevance scores for each document.
For any questions regarding the implementation, feel free to contact Letian Jiang.