Skip to content

Commit

Permalink
[GJ-13] refine Arrow building process and update docs (facebookincub…
Browse files Browse the repository at this point in the history
…ator#17)

* add Arrow build

* refine docs
  • Loading branch information
rui-mo authored Dec 21, 2021
1 parent fa2a4ec commit f6fe2fd
Show file tree
Hide file tree
Showing 7 changed files with 217 additions and 63 deletions.
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,19 +56,21 @@ Ideally if all native library can return arrow record batch, we can share much f

# How to use OAP: Gazelle-Jni

There are two ways to use OAP: Gazelle-Jni
There are two ways to build OAP: Gazelle-Jni
1. Building by Conda Environment
2. Building by Yourself

### Building by Conda
### Building by Conda (Recommended)

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information.

### Building by yourself

If you prefer to build from the source code on your hand, please follow the steps in [Installation Guide](./docs/GazelleJni.md) to set up your environment.

After that, if you would like to build Gazelle-Jni with **Velox** caculation, please checkout to branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev) and follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 test.
### Notes for Building Gazelle-Jni with Velox

After Gazelle-Jni being successfully deployed in your environment, if you would like to build Gazelle-Jni with **Velox** computing, please checkout to branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev) and follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 test.

# Contact

Expand Down
4 changes: 2 additions & 2 deletions docs/GazelleJni.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ Please notes: If you choose to use libhdfs3.so, there are some other dependency

### Intel Optimized Apache Arrow Installation

During the mvn compile command, it will launch a script(build_arrow.sh) to help install and compile a Intel custom Arrow library.
During the mvn compile command, it will launch a script [build_arrow.sh](../tools/build_arrow.sh) to help install and compile a Intel custom Arrow library.
If you wish to build Apache Arrow by yourself, please follow the guide to build and install Apache Arrow [ArrowInstallation](./ApacheArrowInstallation.md)

### Gazelle Jni
Expand All @@ -200,7 +200,7 @@ Based on the different environment, there are some parameters can be set via -D
| velox_home | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `native-sql-engine/arrow-data-source/script/`.
If you wish to change any parameters from Arrow, you can change it from the [build_arrow.sh](../tools/build_arrow.sh) script.

##### Configure the compiled jar to Spark

Expand Down
26 changes: 13 additions & 13 deletions docs/OAP-Installation-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,17 @@ Create a Conda environment and install OAP Conda package.
$ conda create -n oapenv -c conda-forge -c intel -y oap=1.2.0
```

Once finished steps above, you have completed OAP dependencies installation and OAP building.
Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.

- [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.2.0)
- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
- [Vmemcache](https://github.com/pmem/vmemcache.git)
- [HPNL](https://anaconda.org/intel/hpnl)
- [PMDK](https://github.com/pmem/pmdk)
- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)

Once above steps finish, you have completed OAP dependencies installation and OAP building.
You can start to build Gazelle-Jni in this environment.

##### Compile Gazelle Jni jar
Expand All @@ -53,7 +62,9 @@ Based on the different environment, there are some parameters can be set via -D
| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
| velox_home | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap).

The velox_home option is useful only on branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev). If you would like build Gazelle-Jni with Velox computing, please refer to [Build with Velox](Velox.md) for more information.

##### Configure the compiled jar to Spark

Expand All @@ -62,17 +73,6 @@ spark.driver.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<vers
spark.executor.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<version>-snapshot-jar-with-dependencies.jar
```

Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.

- [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.2.0)
- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
- [Vmemcache](https://github.com/pmem/vmemcache.git)
- [HPNL](https://anaconda.org/intel/hpnl)
- [PMDK](https://github.com/pmem/pmdk)
- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)


#### Extra Steps for Shuffle Remote PMem Extension

If you use one of OAP features -- [PMem Shuffle](https://github.com/oap-project/pmem-shuffle) with **RDMA**, you need to configure and validate RDMA, please refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#4-configure-and-validate-rdma) for the details.
Expand Down
61 changes: 45 additions & 16 deletions docs/Velox.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,43 @@
## Velox

Please refer to [Velox Installation](https://github.com/facebookincubator/velox/blob/main/scripts/setup-ubuntu.sh) to install all the dependencies.
In general, please refer to [Velox Installation](https://github.com/facebookincubator/velox/blob/main/scripts/setup-ubuntu.sh) to install all the dependencies and compile Velox.
In addition to that, there are several points worth attention when compiling Gazelle-Jni with Velox.

Please Note that all the dependent libraries should be compiled as position independent code. That means, for static libraries, "-fPIC" option should be added during compilation.
Firstly, please note that all the Gazelle-Jni required libraries should be compiled as **position independent code**.
That means, for static libraries, "-fPIC" option should be added in their compiling processes.

For Gazelle-Jni compiling, the required static libraries include:
Currently, Gazelle-Jni with Velox depends on below libraries:

Required static libraries are:

- fmt
- folly
- iberty

The required shared libraries include:
Required shared libraries are:

- glog
- double-conversion
- gtest

When compiling Velox, please note that Velox generated static libraries should also be compiled as position independent code. Also, some OBJECT settings should be removed. For these two changes, please refer to this commit [Velox Compiling](https://github.com/rui-mo/velox/commit/b436af6b942b18e7f9dbd15c1e8eea49397e164a).
Gazelle-Jni will try to find above libraries from system lib paths.
If they are not installed there, please copy them to system lib paths,
or change the paths about where to find them specified in [CMakeLists.txt](https://github.com/oap-project/gazelle-jni/blob/velox_dev/cpp/src/CMakeLists.txt).

```shell script
set(SYSTEM_LIB_PATH "/usr/lib" CACHE PATH "System Lib dir")
set(SYSTEM_LIB64_PATH "/usr/lib64" CACHE PATH "System Lib64 dir")
set(SYSTEM_LOCAL_LIB_PATH "/usr/local/lib" CACHE PATH "System Local Lib dir")
set(SYSTEM_LOCAL_LIB64_PATH "/usr/local/lib64" CACHE PATH "System Local Lib64 dir")
```

Secondly, when compiling Velox, please note that Velox generated static libraries should also be compiled as position independent code.
Also, some OBJECT settings in CMakeLists are removed in order to acquire the static libraries.
For these two changes, please refer to this commit [Velox Compiling](https://github.com/rui-mo/velox/commit/b436af6b942b18e7f9dbd15c1e8eea49397e164a).

### An example for Velox computing in Spark based on Gazelle-Jni

TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support has several limitations:
TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support still has several limitations:

- Only Double type is supported.
- Only single-thread is supported.
Expand All @@ -29,6 +46,8 @@ TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support ha

#### Build Gazelle Jni with Velox

Please specify velox_home when compiling Gazelle-Jni with Velox.

``` shell
git clone -b velox_dev https://github.com/oap-project/gazelle-jni.git
cd gazelle-jni
Expand All @@ -45,21 +64,22 @@ Based on the different environment, there are some parameters can be set via -D
| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
| velox_home | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `native-sql-engine/arrow-data-source/script/`.
When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap).
If you wish to change any parameters from Arrow, you can change it from the [build_arrow.sh](../tools/build_arrow.sh) script.

#### test TPC-H Q6 on Gazelle-Jni with Velox computing
#### Test TPC-H Q6 on Gazelle-Jni with Velox computing

##### Data preparation

The ORC table can be generated by converting through Parquet format.
- Considering only Hive LRE V1 is supported in Velox, below Spark option was adopted when generating ORC data.
Considering only Hive LRE V1 is supported in Velox, below Spark option was adopted when generating ORC data.

```shell script
--conf spark.hive.exec.orc.write.format=0.11
```

- Considering Velox's support for Decimal, Date, Long types are not fully ready, the related columns of TPC-H Q6 were all transformed into Double type.
Considering Velox's support for Decimal, Date, Long types are not fully ready, the related columns of TPC-H Q6 were all transformed into Double type.
Below script shows how to convert Parquet into ORC format, and transforming TPC-H Q6 related columns into Double type.
To align with this data type change, the TPC-H Q6 query was changed accordingly.

```shell script
for (filePath <- fileLists) {
Expand All @@ -79,17 +99,26 @@ spark.executor.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<ve

##### Submit the Spark SQL job

The modified TPC-H Q6 query is:

```shell script
select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24
```
Below script shows how to read the ORC data, and submit the modified TPC-H Q6 query.
cat tpch_q6.scala
```shell script
val lineitem = spark.read.format("orc").load("file:///mnt/lineitem_orcs")
lineitem.createOrReplaceTempView("lineitem")
// The modified TPC-H Q6 query
time{spark.sql("select sum(l_extendedprice*l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24").show}
time{spark.sql("select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24").show}
```
Submit test script from spark-shell
Submit test script from spark-shell. Please note that only single-thread is supported currently.
```shell script
cat tpch_q6.scala | spark-shell --name tpch_col_q6 --num-executors 24 --driver-memory 20g --executor-memory 25g --executor-cores 6 --master local --deploy-mode client --conf spark.executor.memoryOverhead=5g --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${dep_jar} --conf spark.executor.extraClassPath=${dep_jar} --conf spark.sql.join.preferSortMergeJoin=false --conf spark.sql.inMemoryColumnarStorage.batchSize=${batchsize} --conf spark.sql.execution.arrow.maxRecordsPerBatch=${batchsize} --conf spark.sql.parquet.columnarReaderBatchSize=${batchsize} --conf spark.sql.autoBroadcastJoinThreshold=30M --conf spark.sql.broadcastTimeout=300 --conf spark.sql.crossJoin.enabled=true --conf spark.driver.maxResultSize=32g --conf spark.sql.shuffle.partitions=200 --conf spark.memory.offHeap.enabled=false --conf spark.memory.offHeap.size=20g --conf spark.sql.adaptive.enabled=true --conf spark.kryoserializer.buffer.max=512m --conf spark.kryoserializer.buffer=128m --conf spark.oap.sql.columnar.preferColumnar=true --conf spark.sql.columnar.sort.broadcast.cache.timeout=300 --conf spark.oap.sql.columnar.shuffle.customizedCompression.codec=lz4 --conf spark.executorEnv.LIBARROW_DIR=/usr/local/lib64 --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.oap.sql.columnar.wholestagecodegen=true --conf spark.sql.files.maxPartitionBytes=1280000000
cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master local --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${gazelle_jvm_jar} --conf spark.executor.extraClassPath=${gazelle_jvm_jar} --conf spark.oap.sql.columnar.preferColumnar=true --conf spark.oap.sql.columnar.wholestagecodegen=true --conf spark.memory.offHeap.size=20g
```
##### Result
Expand All @@ -98,7 +127,7 @@ cat tpch_q6.scala | spark-shell --name tpch_col_q6 --num-executors 24 --driver-m
##### Performance
Below show the TPC-H Q6 Performance comparison in this single-thread test.
Below table shows the TPC-H Q6 Performance in this single-thread test for Velox and vanilla Spark.
| TPC-H Q6 Performance | Velox | Vanilla Spark on Parquet Data | Vanilla Spark on ORC Data |
| ---------- | ----------- | ------------- | ------------- |
Expand Down
71 changes: 44 additions & 27 deletions jvm/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@
<hive.parquet.group>com.twitter</hive.parquet.group>
<parquet.deps.scope>provided</parquet.deps.scope>
<jars.target.dir>${project.build.directory}/scala-${scala.binary.version}/jars</jars.target.dir>
<nativesql.cpp_tests>${cpp_tests}</nativesql.cpp_tests>
<nativesql.build_arrow>OFF</nativesql.build_arrow>
<nativesql.static_arrow>${static_arrow}</nativesql.static_arrow>
<nativesql.arrow.bfs.install.dir>${project.basedir}/../../arrow-data-source/script/build/arrow_install</nativesql.arrow.bfs.install.dir>
<nativesql.arrow_root>${arrow_root}</nativesql.arrow_root>
<nativesql.build_protobuf>${build_protobuf}</nativesql.build_protobuf>
<nativesql.build_jemalloc>${build_jemalloc}</nativesql.build_jemalloc>
<jvm.cpp_tests>${cpp_tests}</jvm.cpp_tests>
<jvm.build_arrow>OFF</jvm.build_arrow>
<jvm.static_arrow>${static_arrow}</jvm.static_arrow>
<jvm.arrow.bfs.install.dir>${project.basedir}/../tools/build/arrow_install</jvm.arrow.bfs.install.dir>
<jvm.arrow_root>${arrow_root}</jvm.arrow_root>
<jvm.build_protobuf>${build_protobuf}</jvm.build_protobuf>
<jvm.build_jemalloc>${build_jemalloc}</jvm.build_jemalloc>
<!-- protobuf paths -->
<protobuf.input.directory>${project.basedir}/src/main/java/com/intel/oap/substrait/binary</protobuf.input.directory>
<protobuf.output.directory>${project.build.directory}/generated-sources</protobuf.output.directory>
Expand Down Expand Up @@ -351,26 +351,43 @@
<groupId>org.codehaus.mojo</groupId>
<version>1.6.0</version>
<executions>
<execution>
<id>Build cpp</id>
<phase>generate-resources</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>bash</executable>
<arguments>
<argument>${cpp.dir}/compile.sh</argument>
<argument>${nativesql.cpp_tests}</argument>
<argument>${nativesql.build_arrow}</argument>
<argument>${nativesql.static_arrow}</argument>
<argument>${nativesql.build_protobuf}</argument>
<argument>${nativesql.arrow_root}</argument>
<argument>${nativesql.arrow.bfs.install.dir}</argument>
<argument>${nativesql.build_jemalloc}</argument>
</arguments>
</configuration>
</execution>
<execution>
<id>Build arrow</id>
<phase>generate-resources</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>bash</executable>
<arguments>
<argument>${arrow.script.dir}/build_arrow.sh</argument>
<argument>--tests=${cpp_tests}</argument>
<argument>--build_arrow=${build_arrow}</argument>
<argument>--static_arrow=${static_arrow}</argument>
<argument>--arrow_root=${arrow_root}</argument>
</arguments>
</configuration>
</execution>
<execution>
<id>Build cpp</id>
<phase>generate-resources</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>bash</executable>
<arguments>
<argument>${cpp.dir}/compile.sh</argument>
<argument>${jvm.cpp_tests}</argument>
<argument>${jvm.build_arrow}</argument>
<argument>${jvm.static_arrow}</argument>
<argument>${jvm.build_protobuf}</argument>
<argument>${jvm.arrow_root}</argument>
<argument>${jvm.arrow.bfs.install.dir}</argument>
<argument>${jvm.build_jemalloc}</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
<!-- copy protoc binary into build directory -->
Expand Down
3 changes: 1 addition & 2 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -118,11 +118,10 @@
<hadoop.version>${hadoop.version}</hadoop.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<arrow.script.dir>${project.basedir}/script</arrow.script.dir>
<arrow.script.dir>${project.basedir}/../tools</arrow.script.dir>
<cpp_tests>OFF</cpp_tests>
<build_arrow>ON</build_arrow>
<static_arrow>OFF</static_arrow>
<arrow.install.dir>${arrow.script.dir}/build/arrow_install</arrow.install.dir>
<arrow_root>/usr/local</arrow_root>
<build_protobuf>ON</build_protobuf>
<build_jemalloc>ON</build_jemalloc>
Expand Down
Loading

0 comments on commit f6fe2fd

Please sign in to comment.