Text Retrieval and Relevance Ranking

Overview

This project implements various text retrieval and relevance ranking models for analyzing a large dataset of news articles. The assignment is split into two parts:

Part I focuses on parsing text data, building vocabulary, and ranking documents based on different models such as Bit Vector and TF-IDF.
Part II introduces more advanced techniques using Word2Vec for document relevance.

Vector Space Model

Word2Vec Model

The project is implemented in two Python files:

textretrieval.py: Contains implementations for Tasks 1-3 (Text Parsing, Bit Vector Model, TF-IDF Model).
Word2Vec-TFDF.py: Contains implementations for Task 4 (Word2Vec) and Task 5 (extra credit).

Dataset

The dataset used for this project is AG's News Topic Classification Dataset, specifically the test.csv file from this link. The dataset contains news articles, each with a title, description, and class. For this project, we focus on the "description" field.

Requirements

The project is built using Python 3, and the following libraries are required:

Pandas
NumPy
NLTK
Gensim (for Word2Vec)

To install the required dependencies, run:

pip install -r requirements.txt

Instructions

Part I: Text Parsing and Vector Space Models (`textretrieval.py`)

Task 1: Text Data Parsing and Vocabulary Selection
- Cleans and preprocesses the dataset by removing stop-words, punctuation, numbers, HTML tags, and excess whitespaces.
- Builds a vocabulary of the top 200 most frequent words from the dataset.
Task 2: Document Relevance with Bit Vector Model
- Implements a basic Vector Space Model (VSM) using a bit-vector representation.
- Computes relevance scores for documents based on the query.
Task 3: Document Relevance with TF-IDF Model
- Implements the TF-IDF model using Okapi-BM25 (without document length normalization).
- Ranks documents based on the relevance to the provided queries.

Part II: Word2Vec Model (`Word2Vec-TFDF.py`)

Task 4: Document Relevance with Word2Vec
- Uses Word2Vec to compute word relevance based on pre-trained word embeddings.
- Scores documents using the average log-likelihood of Word2Vec embeddings.
Task 5 (Extra Credit): TF-IDF with Document Length Normalization
- Extends the TF-IDF model from Task 3 by adding document length normalization.

How to Run

To execute the text parsing and relevance models:

Run textretrieval.py for Tasks 1-3:
```
python textretrieval.py
```
Run Word2Vec-TFDF.py for Tasks 4-5:
```
python Word2Vec-TFDF.py
```

Each script will process the dataset and output the top 5 most relevant documents and the bottom 5 least relevant documents based on the provided queries.

Queries Tested

The following queries are used to test the models:

Query 1: "olympic gold athens"
Query 2: "reuters stocks friday"
Query 3: "investment market prices"

The results for each query are printed to the console.

Output

For each model, the output includes:

Top 5 most relevant documents.
Bottom 5 least relevant documents.
Relevance scores for each document.

Contact

For any questions regarding the implementation, feel free to contact Letian Jiang.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
images		images
README.md		README.md
Word2Vec-TFDF.py		Word2Vec-TFDF.py
test.csv		test.csv
textretrieval.py		textretrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Retrieval and Relevance Ranking

Overview

Vector Space Model

Word2Vec Model

Dataset

Requirements

Instructions

Part I: Text Parsing and Vector Space Models (`textretrieval.py`)

Part II: Word2Vec Model (`Word2Vec-TFDF.py`)

How to Run

Queries Tested

Output

Contact

About

Releases

Packages

Languages

realavocado/Text-Retrieval-and-Doc-Relevance-Ranking

Folders and files

Latest commit

History

Repository files navigation

Text Retrieval and Relevance Ranking

Overview

Vector Space Model

Word2Vec Model

Dataset

Requirements

Instructions

Part I: Text Parsing and Vector Space Models (textretrieval.py)

Part II: Word2Vec Model (Word2Vec-TFDF.py)

How to Run

Queries Tested

Output

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Part I: Text Parsing and Vector Space Models (`textretrieval.py`)

Part II: Word2Vec Model (`Word2Vec-TFDF.py`)

Packages