NYC Taxi 2017

Analysis through Big Data techniques

Spark
Spark SQL
Google Cloud Platform: Dataproc

Obtaining Data

Data can be downloaded from NYC TLC website. From NYC TLC data will be given by month (January, February, March...).

Website to download: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Website to download the taxi zone: https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

Data

12 files (1 per month)
9.3GB of data
112,494,978 rows in dataset

Tools used

Google Cloud Platform: Dataproc
Scala
Spark SQL

Preprocessing

Filtered null values
Joined dataset with Taxi Zone from NYC TLC
Creation of new columns
Only pickup and dropoff from 2017
Calculating time for queries

How to run

Variables

Before, it is important to specify the location where all the 12 CSV files are, and the TaxiZone file.

Location for 2017 can be modified from variable "allmonths", and for file Taxizone is "taxizone" variable.

Running

In order to run the code, it is important to run it in the Spark interpreter with Scala.

Analysis

Analysis on trips overall

Average distance travelled
Average time travelled
Efficiency based on distance over time
Average passenger per ride
Average tip per ride
Average total paid per trip
Total trips per hour
Total trips per month
Common pickup locations

Analysis based on time and location

Average trips per day of week
Average distance travelled by time of day
Average distance by time of day per borough
Average tips by time and day of the week
Pickup location by season
Percentage pickup and drop off per borough
General drop off locations vs. Nightlife drop off locations

Analysis based on fare/payment

Most common payment type per borough
Most common location per payment type per borough
Most common payment type according to hour
Average gross revenue per hour

Performance Evaluation

An evaluation of the time it took to run every script was made. We used the function System.NanoTime before the script, and after. Difference was saved into a file and the average was calculated in Excel to add to paper.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
result		result
results_csv		results_csv
taxinyc_map_files		taxinyc_map_files
NYCTaxi2018_paper.pdf		NYCTaxi2018_paper.pdf
README.md		README.md
taxi+_zone_lookup.csv		taxi+_zone_lookup.csv
taxi_nyc.scala		taxi_nyc.scala
taxizone.csv		taxizone.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi 2017

Analysis through Big Data techniques

Obtaining Data

Data

Tools used

Preprocessing

How to run

Variables

Running

Analysis

Analysis on trips overall

Analysis based on time and location

Analysis based on fare/payment

Performance Evaluation

About

Releases

Packages

Languages

luisadrianml/nyctaxi2017

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi 2017

Analysis through Big Data techniques

Obtaining Data

Data

Tools used

Preprocessing

How to run

Variables

Running

Analysis

Analysis on trips overall

Analysis based on time and location

Analysis based on fare/payment

Performance Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages