Skip to content

Text retrieval and query-doc relevance computation using TF-IDF and Word2Vec model.

Notifications You must be signed in to change notification settings

realavocado/Text-Retrieval-and-Doc-Relevance-Ranking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Retrieval and Relevance Ranking

Overview

This project implements various text retrieval and relevance ranking models for analyzing a large dataset of news articles. The assignment is split into two parts:

  1. Part I focuses on parsing text data, building vocabulary, and ranking documents based on different models such as Bit Vector and TF-IDF.
  2. Part II introduces more advanced techniques using Word2Vec for document relevance.

Vector Space Model

Vector Space Model

Word2Vec Model

Word2Vec Model

The project is implemented in two Python files:

  • textretrieval.py: Contains implementations for Tasks 1-3 (Text Parsing, Bit Vector Model, TF-IDF Model).
  • Word2Vec-TFDF.py: Contains implementations for Task 4 (Word2Vec) and Task 5 (extra credit).

Dataset

The dataset used for this project is AG's News Topic Classification Dataset, specifically the test.csv file from this link. The dataset contains news articles, each with a title, description, and class. For this project, we focus on the "description" field.


Requirements

The project is built using Python 3, and the following libraries are required:

  • Pandas
  • NumPy
  • NLTK
  • Gensim (for Word2Vec)

To install the required dependencies, run:

pip install -r requirements.txt

Instructions

Part I: Text Parsing and Vector Space Models (textretrieval.py)

  1. Task 1: Text Data Parsing and Vocabulary Selection

    • Cleans and preprocesses the dataset by removing stop-words, punctuation, numbers, HTML tags, and excess whitespaces.
    • Builds a vocabulary of the top 200 most frequent words from the dataset.
  2. Task 2: Document Relevance with Bit Vector Model

    • Implements a basic Vector Space Model (VSM) using a bit-vector representation.
    • Computes relevance scores for documents based on the query.
  3. Task 3: Document Relevance with TF-IDF Model

    • Implements the TF-IDF model using Okapi-BM25 (without document length normalization).
    • Ranks documents based on the relevance to the provided queries.

Part II: Word2Vec Model (Word2Vec-TFDF.py)

  1. Task 4: Document Relevance with Word2Vec

    • Uses Word2Vec to compute word relevance based on pre-trained word embeddings.
    • Scores documents using the average log-likelihood of Word2Vec embeddings.
  2. Task 5 (Extra Credit): TF-IDF with Document Length Normalization

    • Extends the TF-IDF model from Task 3 by adding document length normalization.

How to Run

To execute the text parsing and relevance models:

  1. Run textretrieval.py for Tasks 1-3:

    python textretrieval.py
  2. Run Word2Vec-TFDF.py for Tasks 4-5:

    python Word2Vec-TFDF.py

Each script will process the dataset and output the top 5 most relevant documents and the bottom 5 least relevant documents based on the provided queries.


Queries Tested

The following queries are used to test the models:

  • Query 1: "olympic gold athens"
  • Query 2: "reuters stocks friday"
  • Query 3: "investment market prices"

The results for each query are printed to the console.


Output

For each model, the output includes:

  • Top 5 most relevant documents.
  • Bottom 5 least relevant documents.
  • Relevance scores for each document.

Contact

For any questions regarding the implementation, feel free to contact Letian Jiang.

About

Text retrieval and query-doc relevance computation using TF-IDF and Word2Vec model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages