Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML-55] [CPU] Add Naive Bayes #68

Merged
merged 29 commits into from
Jun 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
c67e79e
Add NaiveBayes skeleton code
xwu99 May 11, 2021
c05893a
define ccl_root and ccl:gather
xwu99 May 11, 2021
e4984aa
Add NaiveBayesDALImpl scala,java & jni
xwu99 May 11, 2021
f4f2baa
Add doublesToNumericTables
xwu99 May 11, 2021
47f9b95
Add trainModel and trainingResult
xwu99 May 11, 2021
e2c2a87
Add getOneCCLIPPort to Utils
xwu99 May 11, 2021
c2fa328
Add return model, to be filled
xwu99 May 11, 2021
2a0624f
Fix format-cpp
xwu99 May 12, 2021
26e6d5b
Add numericTableNx1ToVector & numericTableToMatrix
xwu99 May 12, 2021
ef101c0
format code
xwu99 May 12, 2021
8fc9a4f
CSR support
xwu99 May 12, 2021
4cffed4
Add labeledPointsToMergedNumericTables, to be tested
xwu99 May 12, 2021
ea02438
format code
xwu99 May 13, 2021
8098c4f
Fix result return bug
xwu99 May 13, 2021
bfa7135
Add oldLabels to be compatible with mllib bayes
xwu99 May 13, 2021
057af90
Refactor and support convert to CSR table
xwu99 May 17, 2021
de55829
Add Profiler
xwu99 May 18, 2021
69a9a23
Fix profiler duration
xwu99 May 18, 2021
34d2159
Improve instrumentation
xwu99 May 19, 2021
a943d30
Add NaiveBayesExample
xwu99 May 19, 2021
4f34df1
todo: can't use merged table for csr, need to optimize csr data conve…
xwu99 May 20, 2021
93c1593
Optimize data conversion with dataset
xwu99 May 27, 2021
51e72e8
Add time measurement
xwu99 May 31, 2021
eb5bb67
optimize numericTableToMatrix & numericTableNx1ToVector
xwu99 May 31, 2021
b36e070
use fixed IP PORT
xwu99 May 31, 2021
af417c2
Add spark.oap.mllib.classification.classes & fix empty partition
xwu99 Jun 3, 2021
8973f58
Add rddLabeledPointToSparseTables & isDenseDataset
xwu99 Jun 8, 2021
8056989
Add rddLabeledPointToSparseTables_shuffle
xwu99 Jun 9, 2021
304a7ee
code cleanup
xwu99 Jun 9, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dev/codestyle/format-cpp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ if [ -z $CLANG_FORMAT ]; then
exit 1
fi

if [ -f .clang-format ]; then
if [ ! -f .clang-format ]; then
echo .clang-format is not found in current directory, please generate it.
exit 1
fi
Expand Down
3 changes: 3 additions & 0 deletions examples/naive-bayes/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash

mvn clean package
94 changes: 94 additions & 0 deletions examples/naive-bayes/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.intel.oap</groupId>
<artifactId>oap-mllib-examples</artifactId>
<version>${oap.version}-with-spark-${spark.version}</version>
<packaging>jar</packaging>

<name>NaiveBayesExample</name>
<url>https://github.com/oap-project/oap-mllib.git</url>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<oap.version>1.1.0</oap.version>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.0.0</spark.version>
</properties>

<dependencies>

<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.10</version>
</dependency>

<dependency>
<groupId>com.github.scopt</groupId>
<artifactId>scopt_2.12</artifactId>
<version>3.7.0</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>

</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
</args>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>
26 changes: 26 additions & 0 deletions examples/naive-bayes/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env bash

source ../../conf/env.sh

APP_JAR=target/oap-mllib-examples-$OAP_MLLIB_VERSION-with-spark-3.0.0.jar
APP_CLASS=org.apache.spark.examples.ml.NaiveBayesExample
DATA_FILE=data/sample_libsvm_data.txt

time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER -v \
--num-executors $SPARK_NUM_EXECUTORS \
--driver-memory $SPARK_DRIVER_MEMORY \
--executor-cores $SPARK_EXECUTOR_CORES \
--executor-memory $SPARK_EXECUTOR_MEMORY \
--conf "spark.oap.mllib.enabled=true" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.default.parallelism=$SPARK_DEFAULT_PARALLELISM" \
--conf "spark.sql.shuffle.partitions=$SPARK_DEFAULT_PARALLELISM" \
--conf "spark.driver.extraClassPath=$SPARK_DRIVER_CLASSPATH" \
--conf "spark.executor.extraClassPath=$SPARK_EXECUTOR_CLASSPATH" \
--conf "spark.shuffle.reduceLocality.enabled=false" \
--conf "spark.network.timeout=1200s" \
--conf "spark.task.maxFailures=1" \
--jars $OAP_MLLIB_JAR \
--class $APP_CLASS \
$APP_JAR $DATA_FILE $K \
2>&1 | tee NaiveBayes-$(date +%m%d_%H_%M_%S).log
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

// scalastyle:off println
package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// $example off$
import org.apache.spark.sql.SparkSession

object NaiveBayesExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("NaiveBayesExample")
.getOrCreate()

if (args.length != 1) {
println("Require data file path as input parameter")
sys.exit(1)
}

// $example on$
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load(args(0))

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)

// Train a NaiveBayes model.
val model = new NaiveBayes()
.fit(trainingData)

// Select example rows to display.
val predictions = model.transform(testData)
predictions.show()

// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy")
// $example off$

spark.stop()
}
}
// scalastyle:on println
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
package org.apache.spark.ml.classification;

public class NaiveBayesResult {
public long piNumericTable;
public long thetaNumericTable;
}
2 changes: 0 additions & 2 deletions mllib-dal/src/main/native/ALSDALImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@ using namespace daal;
using namespace daal::algorithms;
using namespace daal::algorithms::implicit_als;

const int ccl_root = 0;

typedef float algorithmFPType; /* Algorithm floating-point type */

NumericTablePtr userOffset;
Expand Down
2 changes: 0 additions & 2 deletions mllib-dal/src/main/native/KMeansDALImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ using namespace std;
using namespace daal;
using namespace daal::algorithms;

const int ccl_root = 0;

typedef double algorithmFPType; /* Algorithm floating-point type */

static NumericTablePtr kmeans_compute(int rankId, ccl::communicator &comm,
Expand Down
6 changes: 4 additions & 2 deletions mllib-dal/src/main/native/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,15 @@ CPP_SRCS += \
./OneCCL.cpp ./OneDAL.cpp ./service.cpp ./error_handling.cpp \
./KMeansDALImpl.cpp \
./PCADALImpl.cpp \
./ALSDALImpl.cpp ./ALSShuffle.cpp
./ALSDALImpl.cpp ./ALSShuffle.cpp \
./NaiveBayesDALImpl.cpp

OBJS += \
./OneCCL.o ./OneDAL.o ./service.o ./error_handling.o \
./KMeansDALImpl.o \
./PCADALImpl.o \
./ALSDALImpl.o ./ALSShuffle.o
./ALSDALImpl.o ./ALSShuffle.o \
./NaiveBayesDALImpl.o

# Output Binary
OUTPUT = ../../../target/libMLlibDAL.so
Expand Down
Loading