Computing PageRank on Spark RDD

Project Overview

This project aims to deploy a Docker-based HDFS/Spark cluster and demonstrate its ability to perform distributed data storage and computation tasks. We will set up a three-node cluster, consisting of a master node and two worker nodes, where HDFS serves as the storage layer and Spark as the computation engine. The project consists of several key tasks, including HDFS setup, Spark deployment, and implementing specific benchmarks and PageRank computation.

Setup Instructions

Prerequisites

Before proceeding, ensure the following software is installed:

Docker: Follow instructions at Docker’s official website.
Git: Required for cloning the repository and managing version control.
Bash Shell: Available on most Unix-like systems (Linux, macOS).

Steps to Run the Project

Clone the Repository:

git https://github.com/realavocado/Computing-PageRank-on-Spark-RDD.git
cd Computing-PageRank-on-Spark-RDD

Start the Cluster: In terminal A, start the Docker cluster:
```
bash start-all.sh
```
Verify Cluster Status: After the cluster starts, use the following command to check if all nodes are properly connected:
```
docker-compose -f cs511p1-compose.yaml exec main hdfs dfsadmin -report
```
Run Tests: After verifying the cluster, run the provided test scripts:
```
bash test-all.sh
```

Tasks Overview

Part 1: HDFS - Scaling Out Storage

In this part, we configure the Hadoop Distributed File System (HDFS) across three nodes. After setting up, we verify that HDFS reports 3 live DataNodes and performs basic file read/write operations. The TeraSort benchmark is also implemented to demonstrate HDFS performance under heavy load.

Part 2: Spark - Speeding Up Computation

Spark is deployed on top of the HDFS cluster, allowing distributed data processing. This part includes setting up Spark, verifying the connection between executors, and running computation benchmarks like Pi estimation and TeraSort. Spark also reads data from HDFS and processes it within the cluster.

Part 3: HDFS/Spark Sorting

In this part, we process a large dataset of cap serial numbers, filtering out invalid entries and sorting the remaining dataset. This demonstrates both HDFS storage capabilities and Spark’s computational efficiency.

Part 4: HDFS/Spark PageRank

This task involves implementing the PageRank algorithm to compute the importance of nodes in a network based on their connections. The network dataset is represented as a CSV file where each line denotes a link between two nodes. See PageRank Algorithm.

PageRank Overview

The PageRank algorithm is a powerful tool for ranking nodes in a graph, widely used in search engines to rank websites based on their link structure.
We optimized memory usage and performance by tuning Spark configurations and using efficient data structures in the PageRank computation.

Steps to Implement PageRank:

Input Dataset: The dataset is imported into HDFS, which stores the connections between nodes as edges in a graph.
Algorithm: We implement the PageRank algorithm in Spark using the following key parameters:
- Damping factor (d): Set to 0.85, this parameter prevents sinks in the graph from monopolizing the ranking.
- Tolerance (ε): Set to 0.0001, to determine when the algorithm should stop iterating.
Output: After computation, the PageRank value for each node is written to a CSV file. Results are sorted in descending order based on PageRank values, with ties broken by ascending node ID.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
resources		resources
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
cs511p1-common.Dockerfile		cs511p1-common.Dockerfile
cs511p1-compose.yaml		cs511p1-compose.yaml
cs511p1-main.Dockerfile		cs511p1-main.Dockerfile
cs511p1-worker.Dockerfile		cs511p1-worker.Dockerfile
setup-main.sh		setup-main.sh
setup-worker.sh		setup-worker.sh
start-all.sh		start-all.sh
start-main.sh		start-main.sh
start-worker.sh		start-worker.sh
test-all.sh		test-all.sh
test_hdfs.sh		test_hdfs.sh
test_spark.sh		test_spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computing PageRank on Spark RDD

Project Overview

Setup Instructions

Prerequisites

Steps to Run the Project

Tasks Overview

Part 1: HDFS - Scaling Out Storage

Part 2: Spark - Speeding Up Computation

Part 3: HDFS/Spark Sorting

Part 4: HDFS/Spark PageRank

PageRank Overview

Steps to Implement PageRank:

Resources

About

Releases

Packages

Languages

realavocado/Computing-PageRank-on-Spark

Folders and files

Latest commit

History

Repository files navigation

Computing PageRank on Spark RDD

Project Overview

Setup Instructions

Prerequisites

Steps to Run the Project

Tasks Overview

Part 1: HDFS - Scaling Out Storage

Part 2: Spark - Speeding Up Computation

Part 3: HDFS/Spark Sorting

Part 4: HDFS/Spark PageRank

PageRank Overview

Steps to Implement PageRank:

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages