Skip to content

Latest commit

 

History

History
executable file
·
119 lines (103 loc) · 3.42 KB

README.md

File metadata and controls

executable file
·
119 lines (103 loc) · 3.42 KB

NYC Taxi Transportation Data Analysis

Big Data Analysis (CS-GY-9223) Course Project

Members

Dingming Zhou Xinyu Wu Mufan Sang
dz1108 xw1386 ms9903

Data Collection

Data Source:NYC Taxi & Limousine Commission Trip Record Data

Part I: Data Cleaning

IMPORTANT!!!!!!:
If your are using dumbo cluster at NYU HPC, please put the whole directory under /scratch/your-netid/ directory since the data files exceed the storage limit under /home/your-netid/
you can use cd /scratch/your-netid/ to do that

To download raw Data

./download_raw_data.sh

Run data cleaning process

./data_clean.sh

In this process, it will create directory ./Data and ./Datacleaned to store the raw data and cleaned data files respectively
These data files are stored locally, not in HDFS

For more data cleaning details, see DataInfo.md

Part II: Data Analysis

Push data file into HDFS

hfs -put Datacleaned/. This could take a moment
After pushing into hdfs, you can use hfs -ls Datacleaned to see the loaded data files
We use sequence of hadoop job to do mapreduce since one-stage results has duplicate keys and too much time comsumption if use one reducer

To process data of 2014, use

hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2014.csv -output temp2014.out
Then use
hfs -getmerge temp2014.out 2014.out
hfs -put 2014.out

Then use
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2014.out -output final2014.out to make sure the result are sorted and keys are distinct.

Similarly,
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2015.csv -output temp2015.out hfs -getmerge temp2015.out 2015.out
hfs -put 2015.out
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2015.out -output final2015.out and
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2016.csv -output temp2016.out hfs -getmerge temp2016.out 2016.out
hfs -put 2016.out
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2016.out -output final2016.out

To generate pickup location map, we use hadoop as data collector and matlab as drawer.Use:

hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map3.sh -reducer Code/reduce3.sh -input Datacleaned/yellow_2015.csv -output drawdata.out And
hfs -getmerge drawdata.out drawdata1.out
hfs -put drawdata1.out
hjs -D mapreduce.job.reduces=1 -files /scratch/dz1108/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input drawdata1.out -output draw.out hfs -getmerge draw.out draw.out Then use the code in MatlabCode. We provide same data in the directory and the map picture in PickupLocationMap directory

Progress Record

see ProgressRecord.md at root directory