Big Data Analysis (CS-GY-9223) Course Project
Dingming Zhou | Xinyu Wu | Mufan Sang |
---|---|---|
dz1108 | xw1386 | ms9903 |
Data Source:NYC Taxi & Limousine Commission Trip Record Data
IMPORTANT!!!!!!:
If your are using dumbo cluster at NYU HPC, please put the whole directory under /scratch/your-netid/ directory since the data files exceed the storage limit under /home/your-netid/
you can use cd /scratch/your-netid/
to do that
./download_raw_data.sh
./data_clean.sh
In this process, it will create directory ./Data and ./Datacleaned to store the raw data and cleaned data files respectively
These data files are stored locally, not in HDFS
For more data cleaning details, see DataInfo.md
hfs -put Datacleaned/.
This could take a moment
After pushing into hdfs, you can use hfs -ls Datacleaned
to see the loaded data files
We use sequence of hadoop job to do mapreduce since one-stage results has duplicate keys and too much time comsumption if use one reducer
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2014.csv -output temp2014.out
Then use
hfs -getmerge temp2014.out 2014.out
hfs -put 2014.out
Then use
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2014.out -output final2014.out
to make sure the result are sorted and keys are distinct.
Similarly,
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2015.csv -output temp2015.out
hfs -getmerge temp2015.out 2015.out
hfs -put 2015.out
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2015.out -output final2015.out
and
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map.sh -reducer Code/reduce.sh -input Datacleaned/yellow_2016.csv -output temp2016.out
hfs -getmerge temp2016.out 2016.out
hfs -put 2016.out
hjs -D mapreduce.job.reduces=1 -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input 2016.out -output final2016.out
hjs -files /scratch/YOUR-NETID/BigDataProject/Code -mapper Code/map3.sh -reducer Code/reduce3.sh -input Datacleaned/yellow_2015.csv -output drawdata.out
And
hfs -getmerge drawdata.out drawdata1.out
hfs -put drawdata1.out
hjs -D mapreduce.job.reduces=1 -files /scratch/dz1108/BigDataProject/Code -mapper Code/map2.sh -reducer Code/reduce2.sh -input drawdata1.out -output draw.out
hfs -getmerge draw.out draw.out
Then use the code in MatlabCode
. We provide same data in the directory and the map picture in PickupLocationMap
directory
see ProgressRecord.md at root directory