This project aims to deploy a Docker-based HDFS/Spark cluster and demonstrate its ability to perform distributed data storage and computation tasks. We will set up a three-node cluster, consisting of a master node and two worker nodes, where HDFS serves as the storage layer and Spark as the computation engine. The project consists of several key tasks, including HDFS setup, Spark deployment, and implementing specific benchmarks and PageRank computation.
Before proceeding, ensure the following software is installed:
- Docker: Follow instructions at Docker’s official website.
- Git: Required for cloning the repository and managing version control.
- Bash Shell: Available on most Unix-like systems (Linux, macOS).
-
Clone the Repository:
git https://github.com/realavocado/Computing-PageRank-on-Spark-RDD.git cd Computing-PageRank-on-Spark-RDD
-
Start the Cluster: In terminal A, start the Docker cluster:
bash start-all.sh
-
Verify Cluster Status: After the cluster starts, use the following command to check if all nodes are properly connected:
docker-compose -f cs511p1-compose.yaml exec main hdfs dfsadmin -report
-
Run Tests: After verifying the cluster, run the provided test scripts:
bash test-all.sh
In this part, we configure the Hadoop Distributed File System (HDFS) across three nodes. After setting up, we verify that HDFS reports 3 live DataNodes and performs basic file read/write operations. The TeraSort benchmark is also implemented to demonstrate HDFS performance under heavy load.
Spark is deployed on top of the HDFS cluster, allowing distributed data processing. This part includes setting up Spark, verifying the connection between executors, and running computation benchmarks like Pi estimation and TeraSort. Spark also reads data from HDFS and processes it within the cluster.
In this part, we process a large dataset of cap serial numbers, filtering out invalid entries and sorting the remaining dataset. This demonstrates both HDFS storage capabilities and Spark’s computational efficiency.
This task involves implementing the PageRank algorithm to compute the importance of nodes in a network based on their connections. The network dataset is represented as a CSV file where each line denotes a link between two nodes. See PageRank Algorithm.
- The PageRank algorithm is a powerful tool for ranking nodes in a graph, widely used in search engines to rank websites based on their link structure.
- We optimized memory usage and performance by tuning Spark configurations and using efficient data structures in the PageRank computation.
- Input Dataset: The dataset is imported into HDFS, which stores the connections between nodes as edges in a graph.
- Algorithm: We implement the PageRank algorithm in Spark using the following key parameters:
- Damping factor (d): Set to 0.85, this parameter prevents sinks in the graph from monopolizing the ranking.
- Tolerance (ε): Set to 0.0001, to determine when the algorithm should stop iterating.
- Output: After computation, the PageRank value for each node is written to a CSV file. Results are sorted in descending order based on PageRank values, with ties broken by ascending node ID.