NYC Taxi Transportation Data Analysis

Big Data Analysis (CS-GY-9223) Course Project

Members

Dingming Zhou	Xinyu Wu	Mufan Sang
dz1108	xw1386	ms9903

Data Collection

Data Source:NYC Taxi & Limousine Commission Trip Record Data

Part I: Data Cleaning

IMPORTANT!!!!!!:
If your are using dumbo cluster at NYU HPC, please put the whole directory under /scratch/your-netid/ directory since the data files exceed the storage limit under /home/your-netid/
you can use cd /scratch/your-netid/ to do that

To download raw Data

./download_raw_data.sh

Run data cleaning process

./data_clean.sh

In this process, it will create directory ./Data and ./Datacleaned to store the raw data and cleaned data files respectively
These data files are stored locally, not in HDFS

For more data cleaning details, see DataInfo.md

Part II: Data Analysis

Push data file into HDFS

hfs -put Datacleaned/. This could take a moment
After pushing into hdfs, you can use hfs -ls Datacleaned to see the loaded data files
We use sequence of hadoop job to do mapreduce since one-stage results has duplicate keys and too much time comsumption if use one reducer

To process data of 2014, use

hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2014.csv -output temp2014.out
Then use
hfs -getmerge temp2014.out 2014.out
hfs -put 2014.out

Then use
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2014.out -output final2014.out to make sure the result are sorted and keys are distinct.

Similarly,
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2015.csv -output temp2015.out hfs -getmerge temp2015.out 2015.out
hfs -put 2015.out
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2015.out -output final2015.out and
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2016.csv -output temp2016.out hfs -getmerge temp2016.out 2016.out
hfs -put 2016.out
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2016.out -output final2016.out

To generate pickup location map, we use hadoop as data collector and matlab as drawer.Use:

hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map3.sh -reducer Code/reduce3.sh -input Datacleaned/yellow_2015.csv -output drawdata.out And
hfs -getmerge drawdata.out drawdata1.out
hfs -put drawdata1.out
hjs -D mapreduce.job.reduces=1 -files /scratch/dz1108/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input drawdata1.out -output draw.out hfs -getmerge draw.out draw.out Then use the code in MatlabCode. We provide same data in the directory and the map picture in PickupLocationMap directory

Progress Record

see ProgressRecord.md at root directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NYC Taxi Transportation Data Analysis

Members

Data Collection

Part I: Data Cleaning

To download raw Data

Run data cleaning process

Part II: Data Analysis

Push data file into HDFS

To process data of 2014, use

To generate pickup location map, we use hadoop as data collector and matlab as drawer.Use:

Progress Record

Files

README.md

Latest commit

History

README.md

File metadata and controls

NYC Taxi Transportation Data Analysis

Members

Data Collection

Part I: Data Cleaning

To download raw Data

Run data cleaning process

Part II: Data Analysis

Push data file into HDFS

To process data of 2014, use

To generate pickup location map, we use hadoop as data collector and matlab as drawer.Use:

Progress Record