Skip to content

Commit

Permalink
Merge pull request #4 from HongW2019/doc-1.1
Browse files Browse the repository at this point in the history
[OAP-MLlib-3]Update docs and add mkdocs.yml
  • Loading branch information
zhixingheyi-tian authored Jan 12, 2021
2 parents 0841a6d + be95448 commit 3e0c31d
Show file tree
Hide file tree
Showing 5 changed files with 266 additions and 87 deletions.
52 changes: 28 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,35 @@
# Intel MLlib
# OAP MLlib

## Overview

Intel MLlib is an optimized package to accelerate machine learning algorithms in [Apache Spark MLlib](https://spark.apache.org/mllib). It is compatible with Spark MLlib and leverages open source [Intel® oneAPI Data Analytics Library (oneDAL)](https://github.com/oneapi-src/oneDAL) to provide highly optimized algorithms and get most out of CPU and GPU capabilities. It also take advantage of open source [Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL) to provide efficient communication patterns in multi-node multi-GPU clusters.
OAP MLlib is an optimized package to accelerate machine learning algorithms in [Apache Spark MLlib](https://spark.apache.org/mllib). It is compatible with Spark MLlib and leverages open source [Intel® oneAPI Data Analytics Library (oneDAL)](https://github.com/oneapi-src/oneDAL) to provide highly optimized algorithms and get most out of CPU and GPU capabilities. It also take advantage of open source [Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL) to provide efficient communication patterns in multi-node multi-GPU clusters.

## Compatibility

Intel MLlib tried to maintain the same API interfaces and produce same results that are identical with Spark MLlib. However due to the nature of float point operations, there may be some small deviation from the original result, we will try our best to make sure the error is within acceptable range.
For those algorithms that are not accelerated by Intel MLlib, the original Spark MLlib one will be used.
OAP MLlib tried to maintain the same API interfaces and produce same results that are identical with Spark MLlib. However due to the nature of float point operations, there may be some small deviation from the original result, we will try our best to make sure the error is within acceptable range.
For those algorithms that are not accelerated by OAP MLlib, the original Spark MLlib one will be used.

## Online Documentation

You can find the all the OAP MLlib documents on the [project web page](https://oap-project.github.io/oap-mllib/).

## Getting Started

### Java/Scala Users Preferred

Use a pre-built Intel MLlib JAR to get started. You can firstly download OAP package from [OAP-JARs-Tarball](https://github.com/Intel-bigdata/OAP/releases/download/v0.9.0-spark-3.0.0/oap-0.9.0-bin-spark-3.0.0.tar.gz) and extract this Tarball to get `oap-mllib-x.x.x-with-spark-x.x.x.jar` under `oap-0.9.0-bin-spark-3.0.0/jars`.
Use a pre-built OAP MLlib JAR to get started. You can firstly download OAP package from [OAP-JARs-Tarball](https://github.com/Intel-bigdata/OAP/releases/download/v1.0.0-spark-3.0.0/oap-1.0.0-bin-spark-3.0.0.tar.gz) and extract this Tarball to get `oap-mllib-x.x.x-with-spark-x.x.x.jar` under `oap-1.0.0-bin-spark-3.0.0/jars`.

Then you can refer to the following [Running](#Running) section to try out.
Then you can refer to the following [Running](#running) section to try out.

### Python/PySpark Users Preferred

Use a pre-built JAR to get started. If you have finished [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md), you can find compiled Intel MLlib JAR `oap-mllib-x.x.x-with-spark-x.x.x.jar` in `$HOME/miniconda2/envs/oapenv/oap_jars/`.
Use a pre-built JAR to get started. If you have finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find compiled OAP MLlib JAR `oap-mllib-x.x.x-with-spark-x.x.x.jar` in `$HOME/miniconda2/envs/oapenv/oap_jars/`.

Then you can refer to the following [Running](#Running) section to try out.
Then you can refer to the following [Running](#running) section to try out.

### Building From Scratch

You can also build the package from source code, please refer to [Building](#Building) section.
You can also build the package from source code, please refer to [Building](#building) section.

## Running

Expand All @@ -37,7 +41,7 @@ You can also build the package from source code, please refer to [Building](#Bui

Generally, our common system requirements are the same with Intel® oneAPI Toolkit, please refer to [here](https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-base-toolkit-system-requirements.html) for details.

Intel® oneAPI Toolkits (Beta) components used by the project are already included into JAR package mentioned above. There are no extra installations for cluster nodes.
Intel® oneAPI Toolkits components used by the project are already included into JAR package mentioned above. There are no extra installations for cluster nodes.

### Spark Configuration

Expand All @@ -48,50 +52,50 @@ Users usually run Spark application on __YARN__ with __client__ mode. In that ca
spark.files /path/to/oap-mllib-x.x.x-with-spark-x.x.x.jar
# absolute path of the jar for driver class path
spark.driver.extraClassPath /path/to/oap-mllib-x.x.x-with-spark-x.x.x.jar
# relative path of the jar for executor class path
# relative path to spark.files, just specify jar name in current dir
spark.executor.extraClassPath ./oap-mllib-x.x.x-with-spark-x.x.x.jar
```

### Sanity Check

To use K-means example for sanity check, you need to upload a data file to your HDFS and change related variables in `run.sh` of kmeans example. Then run the following commands:
```
$ cd OAP/oap-mllib/examples/kmeans
$ cd oap-mllib/examples/kmeans
$ ./build.sh
$ ./run.sh
```

### Benchmark with HiBench
Use [Hibench](https://github.com/Intel-bigdata/HiBench) to generate dataset with various profiles, and change related variables in `run-XXX.sh` script when applicable. Then run the following commands:
```
$ cd OAP/oap-mllib/examples/kmeans-hibench
$ cd oap-mllib/examples/kmeans-hibench
$ ./build.sh
$ ./run-hibench-oap-mllib.sh
```

### PySpark Support

As PySpark-based applications call their Scala couterparts, they shall be supported out-of-box. An example can be found in the [Examples](#Examples) section.
As PySpark-based applications call their Scala couterparts, they shall be supported out-of-box. An example can be found in the [Examples](#examples) section.

## Building

### Prerequisites

We use [Apache Maven](https://maven.apache.org/) to manage and build source code. The following tools and libraries are also needed to build Intel MLlib:
We use [Apache Maven](https://maven.apache.org/) to manage and build source code. The following tools and libraries are also needed to build OAP MLlib:

* JDK 8.0+
* Apache Maven 3.6.2+
* GNU GCC 4.8.5+
* Intel® oneAPI Toolkits (Beta) 2021.1-beta07 Components:
* Intel® oneAPI Toolkits 2021.1.1 Components:
- Data Analytics Library (oneDAL)
- Threading Building Blocks (oneTBB)
* [Open Source Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL)

Intel® oneAPI Toolkits (Beta) and its components can be downloaded and install from [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html). Installation process for oneAPI using Package Managers (YUM (DNF), APT, and ZYPPER) is also available. Generally you only need to install oneAPI Base Toolkit for Linux with all or selected components mentioned above. Instead of using oneCCL included in Intel® oneAPI Toolkits (Beta), we prefer to build from open source oneCCL to resolve some bugs.
Intel® oneAPI Toolkits and its components can be downloaded and install from [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html). Installation process for oneAPI using Package Managers (YUM (DNF), APT, and ZYPPER) is also available. Generally you only need to install oneAPI Base Toolkit for Linux with all or selected components mentioned above. Instead of using oneCCL included in Intel® oneAPI Toolkits, we prefer to build from open source oneCCL to resolve some bugs.

More details abount oneAPI can be found [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html).
More details about oneAPI can be found [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html).

__Note: We have verified the building process based on oneAPI 2021.1-beta07. Due to default installation path change in 2021.1-beta08+, it will not work for 2021.1-beta08+. We will fix it soon. You can also refer to [this script and comments in it](https://github.com/Intel-bigdata/OAP/blob/master/oap-mllib/dev/install-build-deps-centos.sh) to install correct oneAPI version and manually setup the environments.__
You can also refer to [this script and comments in it](https://github.com/Intel-bigdata/OAP/blob/branch-1.0-spark-3.x/oap-mllib/dev/install-build-deps-centos.sh) to install correct oneAPI version and manually setup the environments.

Scala and Java dependency descriptions are already included in Maven POM file.

Expand All @@ -103,23 +107,23 @@ To clone and build from open source oneCCL, run the following commands:
```
$ git clone https://github.com/oneapi-src/oneCCL
$ cd oneCCL
$ git checkout -b 2021.1-beta07-1 origin/2021.1-beta07-1
$ git checkout beta08
$ mkdir build && cd build
$ cmake ..
$ make -j install
```

The generated files will be placed in `/your/oneCCL_source_code/build/_install`

#### Building Intel MLlib
#### Building OAP MLlib

To clone and checkout source code, run the following commands:
```
$ git clone https://github.com/Intel-bigdata/OAP
$ git clone https://github.com/oap-project/oap-mllib.git
```
__Optional__ to checkout specific release branch:
```
$ git checkout -b branch-0.9-spark-3.x origin/branch-0.9-spark-3.x
$ cd oap-mllib && git checkout ${version}
```

We rely on environment variables to find required toolchains and libraries. Please make sure the following environment variables are set for building:
Expand All @@ -144,7 +148,7 @@ If you prefer to buid your own open source [oneDAL](https://github.com/oneapi-sr

To build, run the following commands:
```
$ cd OAP/oap-mllib/mllib-dal
$ cd oap-mllib/mllib-dal
$ ./build.sh
```

Expand Down
92 changes: 42 additions & 50 deletions docs/Developer-Guide.md → docs/OAP-Developer-Guide.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# OAP Developer Guide

This document contains the instructions & scripts on installing necessary dependencies and building OAP.
You can get more detailed information from OAP each module blew.
You can get more detailed information from OAP each module below.

* [SQL Index and Data Source Cache](../oap-cache/oap/docs/Developer-Guide.md)
* [RDD Cache PMem Extension](../oap-spark/README.md#compiling)
* [Shuffle Remote PMem Extension](../oap-shuffle/RPMem-shuffle/README.md#5-install-dependencies-for-shuffle-remote-pmem-extension)
* [Remote Shuffle](../oap-shuffle/remote-shuffle/README.md#build-and-deploy)
* [Intel MLlib](../oap-mllib/README.md)
* [Unified Arrow Data Source](../oap-data-source/arrow/README.md)
* [Native SQL Engine](../oap-native-sql/README.md)
* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/master/docs/Developer-Guide.md)
* [PMem Common](https://github.com/oap-project/pmem-common)
* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#5-install-dependencies-for-shuffle-remote-pmem-extension)
* [Remote Shuffle](https://github.com/oap-project/remote-shuffle)
* [OAP MLlib](https://github.com/oap-project/oap-mllib)
* [Arrow Data Source](https://github.com/oap-project/arrow-data-source)
* [Native SQL Engine](https://github.com/oap-project/native-sql-engine)

## Building OAP

Expand All @@ -27,60 +27,64 @@ OAP is built with [Apache Maven](http://maven.apache.org/) and Oracle Java 8, an
- [Arrow](https://github.com/Intel-bigdata/arrow)

- **Requirements for Shuffle Remote PMem Extension**
If enable Shuffle Remote PMem extension with RDMA, you can refer to [Shuffle Remote PMem Extension Guide](../oap-shuffle/RPMem-shuffle/README.md) to configure and validate RDMA in advance.
If enable Shuffle Remote PMem extension with RDMA, you can refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle) to configure and validate RDMA in advance.

We provide scripts below to help automatically install dependencies above **except RDMA**, need change to **root** account, run:

```shell script
```
# git clone -b <tag-version> https://github.com/Intel-bigdata/OAP.git
# cd OAP
# sh dev/install-compile-time-dependencies.sh
# sh $OAP_HOME/dev/install-compile-time-dependencies.sh
```

Run the following command to learn more.

```shell script
```
# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --help
```

Run the following command to automatically install specific dependency such as Maven.

```shell script
```
# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --prepare_maven
```

***NOTE:*** If you use `install-compile-time-dependencies.sh` or `prepare_oap_env.sh` to install GCC, or your GCC is not installed in the default path, please ensure you have exported `CC` (and `CXX`) before calling maven.
```shell script
# export CXX=$OAPHOME/dev/thirdparty/gcc7/bin/g++
# export CC=$OAPHOME/dev/thirdparty/gcc7/bin/gcc
```

### Building

To build OAP package, use
```shell script
To build OAP package, run command below then you can find a tarball named `oap-$VERSION-bin-spark-$VERSION.tar.gz` under directory `$OAP_HOME/dev/release-package `.
```
$ sh $OAP_HOME/dev/compile-oap.sh
#or
$ mvn clean -DskipTests package
```

### Building Specified Module
```shell script
Building Specified OAP Module, such as `oap-cache`, run:
```
$ sh $OAP_HOME/dev/compile-oap.sh --oap-cache
#or
$ mvn clean -pl com.intel.oap:oap-cache -am package
```

### Running Test

To run all the tests, use
```shell script
### Running OAP Unit Tests

Setup building environment manually for intel MLlib, and if your default GCC version is before 7.0 also need export `CC` & `CXX` before using `mvn`, run

```
$ export CXX=$OAP_HOME/dev/thirdparty/gcc7/bin/g++
$ export CC=$OAP_HOME/dev/thirdparty/gcc7/bin/gcc
$ export ONEAPI_ROOT=/opt/intel/inteloneapi
$ source /opt/intel/inteloneapi/daal/2021.1-beta07/env/vars.sh
$ source /opt/intel/inteloneapi/tbb/2021.1-beta07/env/vars.sh
$ source /tmp/oneCCL/build/_install/env/setvars.sh
```

Run all the tests:

```
$ mvn clean test
```

### Running Specified Module Test
Run Specified OAP Module Unit Test, such as `oap-cache`:

```shell script
```
$ mvn clean -pl com.intel.oap:oap-cache -am test
```
Expand All @@ -89,31 +93,19 @@ $ mvn clean -pl com.intel.oap:oap-cache -am test

#### Prerequisites for building with PMem support

When use SQL Index and Data Source Cache with PMem, finish steps of [Prerequisites for building](#Prerequisites-for-building) to ensure needed dependencies have been installed.
When using SQL Index and Data Source Cache with PMem, finish steps of [Prerequisites for building](#prerequisites-for-building) to ensure needed dependencies have been installed.

#### Building package

Add `-Ppersistent-memory` to build OAP with PMem support.

```shell script
$ mvn clean -q -Ppersistent-memory -DskipTests package
```
For `vmemcache` strategy, build OAP with command :
```shell script
$ mvn clean -q -Pvmemcache -DskipTests package
You can build OAP with PMem support with command below:

```
You can build OAP with command below to use all of them:
```shell script
$ mvn clean -q -Ppersistent-memory -Pvmemcache -DskipTests package
$ sh $OAP_HOME/dev/compile-oap.sh
```
Or run:


### OAP Packaging

If you want to generate a release package after you mvn package all modules, use the following command, then you can find a tarball named `oap-$VERSION-bin-spark-3.0.0.tar.gz` under directory `OAP/dev/release-package `.

```shell script
$ sh $OAP_HOME/dev/compile-oap.sh
```
$ mvn clean -q -Ppersistent-memory -Pvmemcache -DskipTests package
```

## Contributing
Expand Down
23 changes: 10 additions & 13 deletions docs/OAP-Installation-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,35 +25,32 @@ For changes to take effect, close and re-open your current shell. To test your i

Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.

- [Arrow](https://github.com/Intel-bigdata/arrow)
- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
- [Memkind](https://anaconda.org/intel/memkind)
- [Vmemcache](https://anaconda.org/intel/vmemcache)
- [HPNL](https://anaconda.org/intel/hpnl)
- [PMDK](https://github.com/pmem/pmdk)
- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)


Create a conda environment and install OAP Conda package.
```bash
$ conda create -n oapenv -y python=3.7
$ conda activate oapenv
$ conda install -c conda-forge -c intel -y oap=0.9.0
$ conda install -c conda-forge -c intel -y oap=1.0.0
```

Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars`

#### Extra Steps for Shuffle Remote PMem Extension

If you use one of OAP features -- [Shuffle Remote PMem Extension](../oap-shuffle/RPMem-shuffle/README.md), there are 2 points to note.

1. Shuffle Remote PMem Extension needs to install library [PMDK](https://github.com/pmem/pmdk) which we haven't provided in OAP Conda package, so you can run commands below to enable PMDK (Certain libraries need to be compiled and installed on your system using ***root*** account, so you need change to `root` account to run the following commands).

```
# git clone -b <tag-version> https://github.com/Intel-bigdata/OAP.git
# cd OAP/
# sh dev/install-runtime-dependencies.sh
```
2. If you also want to use Shuffle Remote PMem Extension with **RDMA**, you need to configure and validate RDMA, please refer to [Shuffle Remote PMem Extension Guide](../oap-shuffle/RPMem-shuffle/README.md#4-configure-and-validate-rdma) for the details.
If you use one of OAP features -- [PMmem Shuffle](https://github.com/oap-project/pmem-shuffle) with **RDMA**, you need to configure and validate RDMA, please refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#4-configure-and-validate-rdma) for the details.


## Configuration
Once finished steps above, make sure libraries installed by Conda can be linked by Spark, please add the following configuration settings to `$SPARK_HOME/conf/spark-defaults` on the working node.

Once finished steps above, make sure libraries installed by Conda can be linked by Spark, please add the following configuration settings to `$SPARK_HOME/conf/spark-defaults.conf`.

```
spark.executorEnv.LD_LIBRARY_PATH $HOME/miniconda2/envs/oapenv/lib
Expand All @@ -65,7 +62,7 @@ spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/$OAP_F

And then you can follow the corresponding feature documents for more details to use them.

* [OAP User Guide](../README.md#user-guide)




Expand Down
Loading

0 comments on commit 3e0c31d

Please sign in to comment.