[GJ-13] refine Arrow building process and update docs (facebookincub…

…ator#17) * add Arrow build * refine docs
rui-mo · Dec 21, 2021 · f6fe2fd · f6fe2fd
1 parent fa2a4ec
commit f6fe2fd
Show file tree

Hide file tree

Showing 7 changed files with 217 additions and 63 deletions.
diff --git a/README.md b/README.md
@@ -56,19 +56,21 @@ Ideally if all native library can return arrow record batch, we can share much f
 
 # How to use OAP: Gazelle-Jni
 
-There are two ways to use OAP: Gazelle-Jni
+There are two ways to build OAP: Gazelle-Jni
 1. Building by Conda Environment
 2. Building by Yourself
 
-### Building by Conda
+### Building by Conda (Recommended)
 
 If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information.
 
 ### Building by yourself
 
 If you prefer to build from the source code on your hand, please follow the steps in [Installation Guide](./docs/GazelleJni.md) to set up your environment.
 
-After that, if you would like to build Gazelle-Jni with **Velox** caculation, please checkout to branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev) and follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 test.
+### Notes for Building Gazelle-Jni with Velox
+
+After Gazelle-Jni being successfully deployed in your environment, if you would like to build Gazelle-Jni with **Velox** computing, please checkout to branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev) and follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 test.
 
 # Contact
 

diff --git a/docs/GazelleJni.md b/docs/GazelleJni.md
@@ -176,7 +176,7 @@ Please notes: If you choose to use libhdfs3.so, there are some other dependency
 
 ### Intel Optimized Apache Arrow Installation
 
-During the mvn compile command, it will launch a script(build_arrow.sh) to help install and compile a Intel custom Arrow library.
+During the mvn compile command, it will launch a script [build_arrow.sh](../tools/build_arrow.sh) to help install and compile a Intel custom Arrow library.
 If you wish to build Apache Arrow by yourself, please follow the guide to build and install Apache Arrow [ArrowInstallation](./ApacheArrowInstallation.md)
 
 ### Gazelle Jni
@@ -200,7 +200,7 @@ Based on the different environment, there are some parameters can be set via -D
 | velox_home | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |
 
 When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
-If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `native-sql-engine/arrow-data-source/script/`.
+If you wish to change any parameters from Arrow, you can change it from the [build_arrow.sh](../tools/build_arrow.sh) script.
 
 ##### Configure the compiled jar to Spark
 

diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md
@@ -31,8 +31,17 @@ Create a Conda environment and install OAP Conda package.
 $ conda create -n oapenv -c conda-forge -c intel -y oap=1.2.0
 ```
 
-Once finished steps above, you have completed OAP dependencies installation and OAP building.
+Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.
 
+- [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.2.0)
+- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
+- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
+- [Vmemcache](https://github.com/pmem/vmemcache.git)
+- [HPNL](https://anaconda.org/intel/hpnl)
+- [PMDK](https://github.com/pmem/pmdk)  
+- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
+
+Once above steps finish, you have completed OAP dependencies installation and OAP building.
 You can start to build Gazelle-Jni in this environment.
 
 ##### Compile Gazelle Jni jar
@@ -53,7 +62,9 @@ Based on the different environment, there are some parameters can be set via -D
 | build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
 | velox_home | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |
 
-When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
+When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap).
+
+The velox_home option is useful only on branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev). If you would like build Gazelle-Jni with Velox computing, please refer to [Build with Velox](Velox.md) for more information.
 
 ##### Configure the compiled jar to Spark
 
@@ -62,17 +73,6 @@ spark.driver.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<vers
 spark.executor.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<version>-snapshot-jar-with-dependencies.jar
 ```
 
-Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.
-
-- [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.2.0)
-- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
-- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
-- [Vmemcache](https://github.com/pmem/vmemcache.git)
-- [HPNL](https://anaconda.org/intel/hpnl)
-- [PMDK](https://github.com/pmem/pmdk)  
-- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
-
-
 #### Extra Steps for Shuffle Remote PMem Extension
 
 If you use one of OAP features -- [PMem Shuffle](https://github.com/oap-project/pmem-shuffle) with **RDMA**, you need to configure and validate RDMA, please refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#4-configure-and-validate-rdma) for the details.

diff --git a/docs/Velox.md b/docs/Velox.md
@@ -1,26 +1,43 @@
 ## Velox
 
-Please refer to [Velox Installation](https://github.com/facebookincubator/velox/blob/main/scripts/setup-ubuntu.sh) to install all the dependencies.
+In general, please refer to [Velox Installation](https://github.com/facebookincubator/velox/blob/main/scripts/setup-ubuntu.sh) to install all the dependencies and compile Velox.
+In addition to that, there are several points worth attention when compiling Gazelle-Jni with Velox.
 
-Please Note that all the dependent libraries should be compiled as position independent code. That means, for static libraries, "-fPIC" option should be added during compilation.
+Firstly, please note that all the Gazelle-Jni required libraries should be compiled as **position independent code**.
+That means, for static libraries, "-fPIC" option should be added in their compiling processes.
 
-For Gazelle-Jni compiling, the required static libraries include:
+Currently, Gazelle-Jni with Velox depends on below libraries:
+
+Required static libraries are:
 
 - fmt
 - folly
 - iberty
 
-The required shared libraries include:
+Required shared libraries are:
 
 - glog
 - double-conversion
 - gtest
 
-When compiling Velox, please note that Velox generated static libraries should also be compiled as position independent code. Also, some OBJECT settings should be removed. For these two changes, please refer to this commit [Velox Compiling](https://github.com/rui-mo/velox/commit/b436af6b942b18e7f9dbd15c1e8eea49397e164a).
+Gazelle-Jni will try to find above libraries from system lib paths.
+If they are not installed there, please copy them to system lib paths,
+or change the paths about where to find them specified in [CMakeLists.txt](https://github.com/oap-project/gazelle-jni/blob/velox_dev/cpp/src/CMakeLists.txt).
+
+```shell script
+set(SYSTEM_LIB_PATH "/usr/lib" CACHE PATH "System Lib dir")
+set(SYSTEM_LIB64_PATH "/usr/lib64" CACHE PATH "System Lib64 dir")
+set(SYSTEM_LOCAL_LIB_PATH "/usr/local/lib" CACHE PATH "System Local Lib dir")
+set(SYSTEM_LOCAL_LIB64_PATH "/usr/local/lib64" CACHE PATH "System Local Lib64 dir")
+```
+
+Secondly, when compiling Velox, please note that Velox generated static libraries should also be compiled as position independent code.
+Also, some OBJECT settings in CMakeLists are removed in order to acquire the static libraries.
+For these two changes, please refer to this commit [Velox Compiling](https://github.com/rui-mo/velox/commit/b436af6b942b18e7f9dbd15c1e8eea49397e164a).
 
 ### An example for Velox computing in Spark based on Gazelle-Jni
 
-TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support has several limitations: 
+TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support still has several limitations: 
 
 - Only Double type is supported.
 - Only single-thread is supported.
@@ -29,6 +46,8 @@ TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support ha
 
 #### Build Gazelle Jni with Velox
 
+Please specify velox_home when compiling Gazelle-Jni with Velox.
+
 ``` shell
 git clone -b velox_dev https://github.com/oap-project/gazelle-jni.git
 cd gazelle-jni
@@ -45,21 +64,22 @@ Based on the different environment, there are some parameters can be set via -D
 | build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
 | velox_home | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |
 
-When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
-If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `native-sql-engine/arrow-data-source/script/`.
+When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap).
+If you wish to change any parameters from Arrow, you can change it from the [build_arrow.sh](../tools/build_arrow.sh) script.
 
-#### test TPC-H Q6 on Gazelle-Jni with Velox computing
+#### Test TPC-H Q6 on Gazelle-Jni with Velox computing
 
 ##### Data preparation
 
-The ORC table can be generated by converting through Parquet format.
-- Considering only Hive LRE V1 is supported in Velox, below Spark option was adopted when generating ORC data. 
+Considering only Hive LRE V1 is supported in Velox, below Spark option was adopted when generating ORC data. 
 
 ```shell script
 --conf spark.hive.exec.orc.write.format=0.11
 ```
 
-- Considering Velox's support for Decimal, Date, Long types are not fully ready, the related columns of TPC-H Q6 were all transformed into Double type.
+Considering Velox's support for Decimal, Date, Long types are not fully ready, the related columns of TPC-H Q6 were all transformed into Double type.
+Below script shows how to convert Parquet into ORC format, and transforming TPC-H Q6 related columns into Double type.
+To align with this data type change, the TPC-H Q6 query was changed accordingly.  
 
 ```shell script
 for (filePath <- fileLists) {
@@ -79,17 +99,26 @@ spark.executor.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<ve
 
 ##### Submit the Spark SQL job
 
+The modified TPC-H Q6 query is:
+
+```shell script
+select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24
+```
+
+Below script shows how to read the ORC data, and submit the modified TPC-H Q6 query.
+
 cat tpch_q6.scala
 ```shell script
 val lineitem = spark.read.format("orc").load("file:///mnt/lineitem_orcs")
 lineitem.createOrReplaceTempView("lineitem")
 // The modified TPC-H Q6 query
-time{spark.sql("select sum(l_extendedprice*l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24").show}
+time{spark.sql("select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24").show}
 ```
 
-Submit test script from spark-shell
+Submit test script from spark-shell. Please note that only single-thread is supported currently. 
+
 ```shell script
-cat tpch_q6.scala | spark-shell --name tpch_col_q6 --num-executors 24 --driver-memory 20g  --executor-memory 25g --executor-cores 6 --master local --deploy-mode client --conf spark.executor.memoryOverhead=5g --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${dep_jar} --conf spark.executor.extraClassPath=${dep_jar} --conf spark.sql.join.preferSortMergeJoin=false --conf spark.sql.inMemoryColumnarStorage.batchSize=${batchsize} --conf spark.sql.execution.arrow.maxRecordsPerBatch=${batchsize} --conf spark.sql.parquet.columnarReaderBatchSize=${batchsize} --conf spark.sql.autoBroadcastJoinThreshold=30M --conf spark.sql.broadcastTimeout=300 --conf spark.sql.crossJoin.enabled=true --conf spark.driver.maxResultSize=32g --conf spark.sql.shuffle.partitions=200 --conf spark.memory.offHeap.enabled=false --conf spark.memory.offHeap.size=20g --conf spark.sql.adaptive.enabled=true --conf spark.kryoserializer.buffer.max=512m  --conf spark.kryoserializer.buffer=128m --conf spark.oap.sql.columnar.preferColumnar=true --conf spark.sql.columnar.sort.broadcast.cache.timeout=300 --conf spark.oap.sql.columnar.shuffle.customizedCompression.codec=lz4 --conf spark.executorEnv.LIBARROW_DIR=/usr/local/lib64 --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.oap.sql.columnar.wholestagecodegen=true --conf spark.sql.files.maxPartitionBytes=1280000000
+cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master local --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${gazelle_jvm_jar} --conf spark.executor.extraClassPath=${gazelle_jvm_jar} --conf spark.oap.sql.columnar.preferColumnar=true --conf spark.oap.sql.columnar.wholestagecodegen=true --conf spark.memory.offHeap.size=20g
 ```
 
 ##### Result
@@ -98,7 +127,7 @@ cat tpch_q6.scala | spark-shell --name tpch_col_q6 --num-executors 24 --driver-m
 
 ##### Performance
 
-Below show the TPC-H Q6 Performance comparison in this single-thread test.
+Below table shows the TPC-H Q6 Performance in this single-thread test for Velox and vanilla Spark.
 
 | TPC-H Q6 Performance | Velox | Vanilla Spark on Parquet Data | Vanilla Spark on ORC Data |
 | ---------- | ----------- | ------------- | ------------- |

diff --git a/jvm/pom.xml b/jvm/pom.xml
@@ -36,13 +36,13 @@
     <hive.parquet.group>com.twitter</hive.parquet.group>
     <parquet.deps.scope>provided</parquet.deps.scope>
     <jars.target.dir>${project.build.directory}/scala-${scala.binary.version}/jars</jars.target.dir>
-    <nativesql.cpp_tests>${cpp_tests}</nativesql.cpp_tests>
-    <nativesql.build_arrow>OFF</nativesql.build_arrow>
-    <nativesql.static_arrow>${static_arrow}</nativesql.static_arrow>
-    <nativesql.arrow.bfs.install.dir>${project.basedir}/../../arrow-data-source/script/build/arrow_install</nativesql.arrow.bfs.install.dir>
-    <nativesql.arrow_root>${arrow_root}</nativesql.arrow_root>
-    <nativesql.build_protobuf>${build_protobuf}</nativesql.build_protobuf>
-    <nativesql.build_jemalloc>${build_jemalloc}</nativesql.build_jemalloc>
+    <jvm.cpp_tests>${cpp_tests}</jvm.cpp_tests>
+    <jvm.build_arrow>OFF</jvm.build_arrow>
+    <jvm.static_arrow>${static_arrow}</jvm.static_arrow>
+    <jvm.arrow.bfs.install.dir>${project.basedir}/../tools/build/arrow_install</jvm.arrow.bfs.install.dir>
+    <jvm.arrow_root>${arrow_root}</jvm.arrow_root>
+    <jvm.build_protobuf>${build_protobuf}</jvm.build_protobuf>
+    <jvm.build_jemalloc>${build_jemalloc}</jvm.build_jemalloc>
     <!-- protobuf paths -->
     <protobuf.input.directory>${project.basedir}/src/main/java/com/intel/oap/substrait/binary</protobuf.input.directory>
     <protobuf.output.directory>${project.build.directory}/generated-sources</protobuf.output.directory>
@@ -351,26 +351,43 @@
           <groupId>org.codehaus.mojo</groupId>
           <version>1.6.0</version>
           <executions>
-              <execution>
-                  <id>Build cpp</id>
-                  <phase>generate-resources</phase>
-                  <goals>
-                      <goal>exec</goal>
-                  </goals>
-                  <configuration>
-                      <executable>bash</executable>
-                      <arguments>
-                          <argument>${cpp.dir}/compile.sh</argument>
-                          <argument>${nativesql.cpp_tests}</argument>
-                          <argument>${nativesql.build_arrow}</argument>
-                          <argument>${nativesql.static_arrow}</argument>
-                          <argument>${nativesql.build_protobuf}</argument>
-                          <argument>${nativesql.arrow_root}</argument>
-                          <argument>${nativesql.arrow.bfs.install.dir}</argument>
-                          <argument>${nativesql.build_jemalloc}</argument>
-                      </arguments>
-                  </configuration>
-              </execution>
+            <execution>
+              <id>Build arrow</id>
+              <phase>generate-resources</phase>
+              <goals>
+                <goal>exec</goal>
+              </goals>
+              <configuration>
+                <executable>bash</executable>
+                <arguments>
+                  <argument>${arrow.script.dir}/build_arrow.sh</argument>
+                  <argument>--tests=${cpp_tests}</argument>
+                  <argument>--build_arrow=${build_arrow}</argument>
+                  <argument>--static_arrow=${static_arrow}</argument>
+                  <argument>--arrow_root=${arrow_root}</argument>
+                </arguments>
+              </configuration>
+            </execution>
+            <execution>
+                <id>Build cpp</id>
+                <phase>generate-resources</phase>
+                <goals>
+                    <goal>exec</goal>
+                </goals>
+                <configuration>
+                    <executable>bash</executable>
+                    <arguments>
+                        <argument>${cpp.dir}/compile.sh</argument>
+                        <argument>${jvm.cpp_tests}</argument>
+                        <argument>${jvm.build_arrow}</argument>
+                        <argument>${jvm.static_arrow}</argument>
+                        <argument>${jvm.build_protobuf}</argument>
+                        <argument>${jvm.arrow_root}</argument>
+                        <argument>${jvm.arrow.bfs.install.dir}</argument>
+                        <argument>${jvm.build_jemalloc}</argument>
+                    </arguments>
+                </configuration>
+            </execution>
           </executions>
       </plugin>
       <!-- copy protoc binary into build directory -->

diff --git a/pom.xml b/pom.xml
@@ -118,11 +118,10 @@
     <hadoop.version>${hadoop.version}</hadoop.version>
     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
     <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
-    <arrow.script.dir>${project.basedir}/script</arrow.script.dir>
+    <arrow.script.dir>${project.basedir}/../tools</arrow.script.dir>
     <cpp_tests>OFF</cpp_tests>
     <build_arrow>ON</build_arrow>
     <static_arrow>OFF</static_arrow>
-    <arrow.install.dir>${arrow.script.dir}/build/arrow_install</arrow.install.dir>
     <arrow_root>/usr/local</arrow_root>
     <build_protobuf>ON</build_protobuf>
     <build_jemalloc>ON</build_jemalloc>