Project 3 for CS585/DS503: Big Data Management
Simple python code for generating the two datasets
We used Pyspark in Jupyter notebook for the queries.
The logic for the Python and Scala implementation are identical. The pseudocode for problems 2.B, 2.C, 2.D will be outlined and the difference in execution times will be listed.
The data generation process is carried out in Python with a generator and uniform distribution for the points is assumed. To get a dataset of about 100MB, 8,000,000 coordinates were generated in the (x,y) format.
The SparkContext is used to read the data from file. A mapper function is used to convert the lines in the data into tuples. Regions are identified by two coordinates instead of a single integer as to simplify the neighborhood computations. For instance, cell 1 corresponds to (0,0), cell 2 to (1,0) and so on... The region ids in the form of coordinates are calculated with a map function. The counts for each region are calculated with a reduceByKey where the key is the region id. For each region, 9 replicas with the neighbors are generated by a flatMap function on the counts. The combineByKey function is used to convert the data to form suitable for relative density computation. The first entry in the tuples correspond to the region ids. The second entry corresponds to the counts within the neighborhood of the region id. A sample record at this point: ((154,349),ListBuffer(50.0, 38.0, 21.0, 23.0, 30.0, 25.0, 32.0, 32.0, 30.0)) The relative density is calculated by calculating the mean of the neighborhood regions from the previous step. The relative density is sorted to retrieve the top k grid cells.
The same operations from the Problem 2.C are utilized. A new RDD where the relative density data is sorted by the region ids. For each region from top k is used to calculate the neighbors and lookup is used on the sorted data to retrieve the relative density.
The data is converted to the format where the first entry is the count and the second entry is the region id converted to coordinates. The region ids are grouped by the count, so that the first entry is the count and the second entry is a list of region ids. When the data is grouped by the count, the region ids are stored in a set. The calculation for POP-NEIGHBORS is done by iterating over this set and checking if the neighbors of the region at hand are in this set. If it is in the set, it is saved in a list.
Problem | Scala | Python |
---|---|---|
2.B | 20 | 78 |
2.C | 37 | 185 |
2.D | 15 | 84 |
As seen from the results above, the scala code outperformed the python implementation. Scala code runs on the JVM and the data structures are more basic, so the Python code runs much slower.