Skip to content
/ BDM-P3 Public

Project 3 for CS585/DS503: Big Data Management

Notifications You must be signed in to change notification settings

sdilbaz/BDM-P3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BDM-P3

Project 3 for CS585/DS503: Big Data Management

Serdarcan Dilbaz and Hamidullah Sakhi

Problem 1 SparkSQL for Processing Purchase Transactions (Hamidullah Sakhi)

Data Generation

Simple python code for generating the two datasets

SparkSQL

We used Pyspark in Jupyter notebook for the queries.

Problem 2 Spark-RDDs: Scala For Computing Relative-Density Scores at Scale (Serdarcan Dilbaz)

The logic for the Python and Scala implementation are identical. The pseudocode for problems 2.B, 2.C, 2.D will be outlined and the difference in execution times will be listed.

Problem 2.A: Create the Datasets

The data generation process is carried out in Python with a generator and uniform distribution for the points is assumed. To get a dataset of about 100MB, 8,000,000 coordinates were generated in the (x,y) format.

Problem 2.B: Report the TOP 10 grid cells with highest Relative-Density Scores

The SparkContext is used to read the data from file. A mapper function is used to convert the lines in the data into tuples. Regions are identified by two coordinates instead of a single integer as to simplify the neighborhood computations. For instance, cell 1 corresponds to (0,0), cell 2 to (1,0) and so on... The region ids in the form of coordinates are calculated with a map function. The counts for each region are calculated with a reduceByKey where the key is the region id. For each region, 9 replicas with the neighbors are generated by a flatMap function on the counts. The combineByKey function is used to convert the data to form suitable for relative density computation. The first entry in the tuples correspond to the region ids. The second entry corresponds to the counts within the neighborhood of the region id. A sample record at this point: ((154,349),ListBuffer(50.0, 38.0, 21.0, 23.0, 30.0, 25.0, 32.0, 32.0, 30.0)) The relative density is calculated by calculating the mean of the neighborhood regions from the previous step. The relative density is sorted to retrieve the top k grid cells.

Problem 2.C: Report the TOP k grid cells w.r.t their Relative-Density Scores

The same operations from the Problem 2.C are utilized. A new RDD where the relative density data is sorted by the region ids. For each region from top k is used to calculate the neighbors and lookup is used on the sorted data to retrieve the relative density.

Problem 2.D: Report groups of similarly populated cells and their connectedness

The data is converted to the format where the first entry is the count and the second entry is the region id converted to coordinates. The region ids are grouped by the count, so that the first entry is the count and the second entry is a list of region ids. When the data is grouped by the count, the region ids are stored in a set. The calculation for POP-NEIGHBORS is done by iterating over this set and checking if the neighbors of the region at hand are in this set. If it is in the set, it is saved in a list.

Python and Scala Time Comparison (in seconds)

Problem Scala Python
2.B 20 78
2.C 37 185
2.D 15 84

As seen from the results above, the scala code outperformed the python implementation. Scala code runs on the JVM and the data structures are more basic, so the Python code runs much slower.

About

Project 3 for CS585/DS503: Big Data Management

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published