Big Data Analysis with Apache Spark: News Article Filtering and Ranking

Overview

This repository contains the implementation of a batch-based text search and filtering pipeline developed for a Big Data course assessed exercise. The pipeline, built using Apache Spark, ranks documents by relevance to user-defined queries and filters out similar documents to return the top 10 documents for each query. The application processes a large set of text documents, applying text preprocessing techniques such as stopword removal and stemming, and scores documents using the DPH ranking model to ensure relevancy and uniqueness in the results.

Features

Document and Query Processing: Transforms text documents and queries by removing stopwords and applying stemming to standardize terms.
DPH Scoring: Utilizes the Document-Personalized-Home (DPH) scoring model to rank documents based on their relevance to each query.
Text Preprocessing: Incorporates a static text pre-processor to tokenize input text, remove stopwords, and apply stemming.
Redundancy Removal: Analyzes document rankings to remove near-duplicate documents based on title similarity, ensuring diversity in the top documents returned for each query.

Structures

NewsArticleQueryDPHScores: Encapsulates DPH scores for a news article query, linking relevant metadata and maintaining a map of scores.
NewsArticleDPHScores: Represents DPH scores associated with a news article document.
NewsArticleDocument: Represents a news article document, including a map of term counts.

Functions

TextTokenizer: Transforms items within an RDD of NewsArticle objects into corresponding NewsArticleDocument objects.
TermGroupsCounter: Formats term groups by counting total occurrences of each term group.
SumReducer: Custom implementation to sum values of Tuple2<String, Integer> objects.
NewsArticleTextDistanceFiltering: Removes redundant documents based on text similarity.
And others focused on implementing the core functionality of the pipeline.

Usage

Requirements

Apache Spark
Java SDK

Running the Application

Clone the repository to your local machine.
Ensure Apache Spark and Java SDK are installed and configured.
Navigate to the project directory.
Download the dataset from https://drive.google.com/drive/folders/1ZUKFlzAPpEX9A71oCgwlXVyP3pJCa3j4?usp=sharing and add it to /data/ folder
Run the Spark application using the following command:
```
spark-submit --class MainApplication target/your-jar-file.jar
```
Replace your-jar-file.jar with the path to the compiled JAR file of the project.

Dataset

The application is designed to process a corpus of news articles from the Washington Post, along with a set of user-defined queries. Two versions of the dataset are used: a local sample for development and testing, and a full dataset for evaluation.

Development

This project was developed as part of the Big Data (H/M) Assessed Exercise for a Big Data course. The implementation was carried out following the specifications provided in the course materials, focusing on code functionality, quality, and efficiency.

Authors

Deepanshu Jain
Philip Joseph

Acknowledgments

Special thanks to the course instructors and TAs for their guidance and support throughout the project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.settings		.settings
data		data
resources		resources
results		results
src		src
target		target
.DS_Store		.DS_Store
.classpath		.classpath
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analysis with Apache Spark: News Article Filtering and Ranking

Overview

Features

Structures

Functions

Usage

Requirements

Running the Application

Dataset

Development

Authors

Acknowledgments

License

About

Releases

Packages

Languages

PhilipNJ/SparkSearchTextSimilarity

Folders and files

Latest commit

History

Repository files navigation

Big Data Analysis with Apache Spark: News Article Filtering and Ranking

Overview

Features

Structures

Functions

Usage

Requirements

Running the Application

Dataset

Development

Authors

Acknowledgments

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages