BDM-P3

Project 3 for CS585/DS503: Big Data Management

Serdarcan Dilbaz and Hamidullah Sakhi

Problem 1 SparkSQL for Processing Purchase Transactions (Hamidullah Sakhi)

Data Generation

Simple python code for generating the two datasets

SparkSQL

We used Pyspark in Jupyter notebook for the queries.

Problem 2 Spark-RDDs: Scala For Computing Relative-Density Scores at Scale (Serdarcan Dilbaz)

The logic for the Python and Scala implementation are identical. The pseudocode for problems 2.B, 2.C, 2.D will be outlined and the difference in execution times will be listed.

Problem 2.A: Create the Datasets

The data generation process is carried out in Python with a generator and uniform distribution for the points is assumed. To get a dataset of about 100MB, 8,000,000 coordinates were generated in the (x,y) format.

Problem 2.B: Report the TOP 10 grid cells with highest Relative-Density Scores

The SparkContext is used to read the data from file. A mapper function is used to convert the lines in the data into tuples. Regions are identified by two coordinates instead of a single integer as to simplify the neighborhood computations. For instance, cell 1 corresponds to (0,0), cell 2 to (1,0) and so on... The region ids in the form of coordinates are calculated with a map function. The counts for each region are calculated with a reduceByKey where the key is the region id. For each region, 9 replicas with the neighbors are generated by a flatMap function on the counts. The combineByKey function is used to convert the data to form suitable for relative density computation. The first entry in the tuples correspond to the region ids. The second entry corresponds to the counts within the neighborhood of the region id. A sample record at this point: ((154,349),ListBuffer(50.0, 38.0, 21.0, 23.0, 30.0, 25.0, 32.0, 32.0, 30.0)) The relative density is calculated by calculating the mean of the neighborhood regions from the previous step. The relative density is sorted to retrieve the top k grid cells.

Problem 2.C: Report the TOP k grid cells w.r.t their Relative-Density Scores

The same operations from the Problem 2.C are utilized. A new RDD where the relative density data is sorted by the region ids. For each region from top k is used to calculate the neighbors and lookup is used on the sorted data to retrieve the relative density.

Problem 2.D: Report groups of similarly populated cells and their connectedness

The data is converted to the format where the first entry is the count and the second entry is the region id converted to coordinates. The region ids are grouped by the count, so that the first entry is the count and the second entry is a list of region ids. When the data is grouped by the count, the region ids are stored in a set. The calculation for POP-NEIGHBORS is done by iterating over this set and checking if the neighbors of the region at hand are in this set. If it is in the set, it is saved in a list.

Python and Scala Time Comparison (in seconds)

Problem	Scala	Python
2.B	20	78
2.C	37	185
2.D	15	84

As seen from the results above, the scala code outperformed the python implementation. Scala code runs on the JVM and the data structures are more basic, so the Python code runs much slower.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Problem 1		Problem 1
Problem_2		Problem_2
.DS_Store		.DS_Store
README.md		README.md
project3-cs585-spark-spring2020.FINAL.pdf		project3-cs585-spark-spring2020.FINAL.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BDM-P3

Serdarcan Dilbaz and Hamidullah Sakhi

Problem 1 SparkSQL for Processing Purchase Transactions (Hamidullah Sakhi)

Data Generation

SparkSQL

Problem 2 Spark-RDDs: Scala For Computing Relative-Density Scores at Scale (Serdarcan Dilbaz)

Problem 2.A: Create the Datasets

Problem 2.B: Report the TOP 10 grid cells with highest Relative-Density Scores

Problem 2.C: Report the TOP k grid cells w.r.t their Relative-Density Scores

Problem 2.D: Report groups of similarly populated cells and their connectedness

Python and Scala Time Comparison (in seconds)

About

Releases

Packages

Languages

sdilbaz/BDM-P3

Folders and files

Latest commit

History

Repository files navigation

BDM-P3

Serdarcan Dilbaz and Hamidullah Sakhi

Problem 1 SparkSQL for Processing Purchase Transactions (Hamidullah Sakhi)

Data Generation

SparkSQL

Problem 2 Spark-RDDs: Scala For Computing Relative-Density Scores at Scale (Serdarcan Dilbaz)

Problem 2.A: Create the Datasets

Problem 2.B: Report the TOP 10 grid cells with highest Relative-Density Scores

Problem 2.C: Report the TOP k grid cells w.r.t their Relative-Density Scores

Problem 2.D: Report groups of similarly populated cells and their connectedness

Python and Scala Time Comparison (in seconds)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages