how to run the sample code:

First obtain the data from Kagel website: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data

Put the tran.csv and test.csv in the same folder. e.g. /home/user/data/nyTaxyData/

Complie and prepare the jar to be handed over to spark.

sbt compile
sbt assembly

Note the location where the assemblies are copied. e.g /home/user/scala-2.11/NyTaxiFare-assembly-1.0.jar

To run in local mode , use the following command. Note that you have to pass some parameters as runtime args.

Parmeters to pass:

Data location : e.g. /home/user/data/nyTaxyData/
Sample of the data to use : e.g 0.5
number of partitions before the Lightgbm process starts : e.g. 16

I run into an issue with lightgbm when I have the sameple size set to 0.5 or larger.

$SPARK_HOME/bin/spark-submit \
  --class tnc.spark.ml.nytaxi.data.processMain \
  --master local[6] \
  --driver-memory 42g \
  /home/user/scala-2.11/NyTaxiFare-assembly-1.0.jar \
  /home/user/data/nyTaxyData/\
  0.5\
  16

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
project		project
src/tnc/spark/ml/nytaxi/data		src/tnc/spark/ml/nytaxi/data
.gitignore		.gitignore
README.MD		README.MD
build.sbt		build.sbt
execte.txt		execte.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

how to run the sample code:

About

Releases

Packages

Languages

thusithaC/nyTaxiPred

Folders and files

Latest commit

History

Repository files navigation

how to run the sample code:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages