- Spark
- Spark SQL
- Google Cloud Platform: Dataproc
Data can be downloaded from NYC TLC website. From NYC TLC data will be given by month (January, February, March...).
Website to download: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Website to download the taxi zone: https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
- 12 files (1 per month)
- 9.3GB of data
- 112,494,978 rows in dataset
- Google Cloud Platform: Dataproc
- Scala
- Spark SQL
- Filtered null values
- Joined dataset with Taxi Zone from NYC TLC
- Creation of new columns
- Only pickup and dropoff from 2017
- Calculating time for queries
Before, it is important to specify the location where all the 12 CSV files are, and the TaxiZone file.
Location for 2017 can be modified from variable "allmonths", and for file Taxizone is "taxizone" variable.
In order to run the code, it is important to run it in the Spark interpreter with Scala.
- Average distance travelled
- Average time travelled
- Efficiency based on distance over time
- Average passenger per ride
- Average tip per ride
- Average total paid per trip
- Total trips per hour
- Total trips per month
- Common pickup locations
- Average trips per day of week
- Average distance travelled by time of day
- Average distance by time of day per borough
- Average tips by time and day of the week
- Pickup location by season
- Percentage pickup and drop off per borough
- General drop off locations vs. Nightlife drop off locations
- Most common payment type per borough
- Most common location per payment type per borough
- Most common payment type according to hour
- Average gross revenue per hour
An evaluation of the time it took to run every script was made. We used the function System.NanoTime before the script, and after. Difference was saved into a file and the average was calculated in Excel to add to paper.