diff --git a/README.md b/README.md index 73d0e7e67..8ad99b936 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ We implemented columnar shuffle to improve the shuffle performance. With the col ### Building by Conda -If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md), you can find built `spark-columnar-core-1.0.0-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`. +If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find built `spark-columnar-core--jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`. Then you can just skip below steps and jump to Getting Started [Get Started](#get-started). ### Building by yourself @@ -61,7 +61,7 @@ Please check the document [Installation Guide](./docs/Installation.md) Please check the document [Configuration Guide](./docs/Configuration.md) ## Get started -To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-1.0.0-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. +To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. SPARK related options are: * `spark.driver.extraClassPath` : Set to load jar file to driver. @@ -79,8 +79,8 @@ ${SPARK_HOME}/bin/spark-shell \ --verbose \ --master yarn \ --driver-memory 10G \ - --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar \ - --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar \ + --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ --conf spark.driver.cores=1 \ --conf spark.executor.instances=12 \ --conf spark.executor.cores=6 \ @@ -91,7 +91,7 @@ ${SPARK_HOME}/bin/spark-shell \ --conf spark.sql.shuffle.partitions=72 \ --conf spark.executorEnv.ARROW_LIBHDFS3_DIR="$PATH_TO_LIBHDFS3_DIR/" \ --conf spark.executorEnv.LD_LIBRARY_PATH="$PATH_TO_LIBHDFS3_DEPENDENCIES_DIR" - --jars $PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar + --jars $PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar ``` Here is one example to verify if native sql engine works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to [Solution Guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql). diff --git a/docs/Configuration.md b/docs/Configuration.md index 6b519d3f0..b20b46f0e 100644 --- a/docs/Configuration.md +++ b/docs/Configuration.md @@ -11,8 +11,8 @@ spark.sql.extensions com.intel.oap.ColumnarPlugin spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager # note native sql engine depends on arrow data source -spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-1.0.0-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar -spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-1.0.0-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar +spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core--jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard--jar-with-dependencies.jar +spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core--jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard--jar-with-dependencies.jar spark.executorEnv.LIBARROW_DIR $HOME/miniconda2/envs/oapenv spark.executorEnv.CC $HOME/miniconda2/envs/oapenv/bin/gcc @@ -20,9 +20,10 @@ spark.executorEnv.CC $HOME/miniconda2/envs/oapenv/bin/gcc ``` Before you start spark, you must use below command to add some environment variables. -```shell script + +``` export CC=$HOME/miniconda2/envs/oapenv/bin/gcc export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/ ``` -About spark-arrow-datasource.jar, you can refer [Unified Arrow Data Source ](https://oap-project.github.io/arrow-data-source/). +About arrow-data-source.jar, you can refer [Unified Arrow Data Source ](https://oap-project.github.io/arrow-data-source/). diff --git a/docs/User-Guide.md b/docs/User-Guide.md index f29151cd2..c3c05cebf 100644 --- a/docs/User-Guide.md +++ b/docs/User-Guide.md @@ -38,7 +38,7 @@ We implemented columnar shuffle to improve the shuffle performance. With the col ### Building by Conda -If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./OAP-Installation-Guide.md), you can find built `spark-columnar-core-1.0.0-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`. +If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./OAP-Installation-Guide.md), you can find built `spark-columnar-core--jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`. Then you can just skip below steps and jump to Getting Started [Get Started](#get-started). ### Building by yourself @@ -57,7 +57,7 @@ Please check the document [Installation Guide](./Installation.md) Please check the document [Configuration Guide](./Configuration.md) ## Get started -To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-1.0.0-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. +To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. SPARK related options are: * `spark.driver.extraClassPath` : Set to load jar file to driver. @@ -75,8 +75,8 @@ ${SPARK_HOME}/bin/spark-shell \ --verbose \ --master yarn \ --driver-memory 10G \ - --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar \ - --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar \ + --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ --conf spark.driver.cores=1 \ --conf spark.executor.instances=12 \ --conf spark.executor.cores=6 \ @@ -87,7 +87,7 @@ ${SPARK_HOME}/bin/spark-shell \ --conf spark.sql.shuffle.partitions=72 \ --conf spark.executorEnv.ARROW_LIBHDFS3_DIR="$PATH_TO_LIBHDFS3_DIR/" \ --conf spark.executorEnv.LD_LIBRARY_PATH="$PATH_TO_LIBHDFS3_DEPENDENCIES_DIR" - --jars $PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar + --jars $PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar ``` Here is one example to verify if native sql engine works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to [Solution Guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql). diff --git a/docs/index.md b/docs/index.md index f29151cd2..a0662883f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -57,7 +57,7 @@ Please check the document [Installation Guide](./Installation.md) Please check the document [Configuration Guide](./Configuration.md) ## Get started -To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-1.0.0-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. +To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. SPARK related options are: * `spark.driver.extraClassPath` : Set to load jar file to driver. @@ -75,8 +75,8 @@ ${SPARK_HOME}/bin/spark-shell \ --verbose \ --master yarn \ --driver-memory 10G \ - --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar \ - --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar \ + --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ --conf spark.driver.cores=1 \ --conf spark.executor.instances=12 \ --conf spark.executor.cores=6 \ @@ -87,7 +87,7 @@ ${SPARK_HOME}/bin/spark-shell \ --conf spark.sql.shuffle.partitions=72 \ --conf spark.executorEnv.ARROW_LIBHDFS3_DIR="$PATH_TO_LIBHDFS3_DIR/" \ --conf spark.executorEnv.LD_LIBRARY_PATH="$PATH_TO_LIBHDFS3_DEPENDENCIES_DIR" - --jars $PATH_TO_JAR/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core-1.0.0-jar-with-dependencies.jar + --jars $PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar ``` Here is one example to verify if native sql engine works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to [Solution Guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql). diff --git a/resource/ApacheArrowInstallation.md b/resource/ApacheArrowInstallation.md deleted file mode 100644 index 9d5cf3ec8..000000000 --- a/resource/ApacheArrowInstallation.md +++ /dev/null @@ -1,70 +0,0 @@ -# llvm-7.0: -Arrow Gandiva depends on LLVM, and I noticed current version strictly depends on llvm7.0 if you installed any other version rather than 7.0, it will fail. -``` shell -wget http://releases.llvm.org/7.0.1/llvm-7.0.1.src.tar.xz -tar xf llvm-7.0.1.src.tar.xz -cd llvm-7.0.1.src/ -cd tools -wget http://releases.llvm.org/7.0.1/cfe-7.0.1.src.tar.xz -tar xf cfe-7.0.1.src.tar.xz -mv cfe-7.0.1.src clang -cd .. -mkdir build -cd build -cmake .. -DCMAKE_BUILD_TYPE=Release -cmake --build . -j -cmake --build . --target install -# check if clang has also been compiled, if no -cd tools/clang -mkdir build -cd build -cmake .. -make -j -make install -``` - -# cmake: -Arrow will download package during compiling, in order to support SSL in cmake, build cmake is optional. -``` shell -wget https://github.com/Kitware/CMake/releases/download/v3.15.0-rc4/cmake-3.15.0-rc4.tar.gz -tar xf cmake-3.15.0-rc4.tar.gz -cd cmake-3.15.0-rc4/ -./bootstrap --system-curl --parallel=64 #parallel num depends on your server core number -make -j -make install -cmake --version -cmake version 3.15.0-rc4 -``` - -# Apache Arrow -``` shell -git clone https://github.com/Intel-bigdata/arrow.git -cd arrow && git checkout branch-0.17.0-oap-1.0 -mkdir -p arrow/cpp/release-build -cd arrow/cpp/release-build -cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON .. -make -j -make install - -# build java -cd ../../java -# change property 'arrow.cpp.build.dir' to the relative path of cpp build dir in gandiva/pom.xml -mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests -# if you are behine proxy, please also add proxy for socks -mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests -DsocksProxyHost=${proxyHost} -DsocksProxyPort=1080 -``` - -run test -``` shell -mvn test -pl adapter/parquet -P arrow-jni -mvn test -pl gandiva -P arrow-jni -``` - -# Copy binary files to oap-native-sql resources directory -Because oap-native-sql plugin will build a stand-alone jar file with arrow dependency, if you choose to build Arrow by yourself, you have to copy below files as a replacement from the original one. -You can find those files in Apache Arrow installation directory or release directory. Below example assume Apache Arrow has been installed on /usr/local/lib64 -``` shell -cp /usr/local/lib64/libarrow.so.17 $oap-dir/oap-native-sql/cpp/src/resources -cp /usr/local/lib64/libgandiva.so.17 $oap-dir/oap-native-sql/cpp/src/resources -cp /usr/local/lib64/libparquet.so.17 $oap-dir/oap-native-sql/cpp/src/resources -``` diff --git a/resource/Configuration.md b/resource/Configuration.md deleted file mode 100644 index c90419df9..000000000 --- a/resource/Configuration.md +++ /dev/null @@ -1,28 +0,0 @@ -# Spark Configurations for Native SQL Engine - -Add below configuration to spark-defaults.conf - -``` -##### Columnar Process Configuration - -spark.sql.sources.useV1SourceList avro -spark.sql.join.preferSortMergeJoin false -spark.sql.extensions com.intel.oap.ColumnarPlugin -spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager - -# note native sql engine depends on arrow data source -spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-1.0.0-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar -spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-1.0.0-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-1.0.0-jar-with-dependencies.jar - -spark.executorEnv.LIBARROW_DIR $HOME/miniconda2/envs/oapenv -spark.executorEnv.CC $HOME/miniconda2/envs/oapenv/bin/gcc -###### -``` - -Before you start spark, you must use below command to add some environment variables. -```shell script -export CC=$HOME/miniconda2/envs/oapenv/bin/gcc -export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/ -``` - -About spark-arrow-datasource.jar, you can refer [Unified Arrow Data Source ](../../oap-data-source/arrow/README.md). diff --git a/resource/Installation.md b/resource/Installation.md deleted file mode 100644 index eab5ca5a0..000000000 --- a/resource/Installation.md +++ /dev/null @@ -1,31 +0,0 @@ -# Spark Native SQL Engine Installation - -For detailed testing scripts, please refer to [solution guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql) - -## Install Googletest and Googlemock - -``` shell -yum install gtest-devel -yum install gmock -``` - -## Build Native SQL Engine - -``` shell -git clone https://github.com/Intel-bigdata/OAP.git -cd OAP && git checkout branch-1.0-spark-3.x -cd oap-native-sql -cd cpp/ -mkdir build/ -cd build/ -cmake .. -DTESTS=ON -make -j -``` - -``` shell -cd ../../core/ -mvn clean package -DskipTests -``` - -### Additonal Notes -[Notes for Installation Issues](/oap-native-sql/resource/InstallationNotes.md) diff --git a/resource/InstallationNotes.md b/resource/InstallationNotes.md deleted file mode 100644 index cf7120be9..000000000 --- a/resource/InstallationNotes.md +++ /dev/null @@ -1,47 +0,0 @@ -### Notes for Installation Issues -* Before the Installation, if you have installed other version of oap-native-sql, remove all installed lib and include from system path: libarrow* libgandiva* libspark-columnar-jni* - -* libgandiva_jni.so was not found inside JAR - -change property 'arrow.cpp.build.dir' to $ARROW_DIR/cpp/release-build/release/ in gandiva/pom.xml. If you do not want to change the contents of pom.xml, specify it like this: - -``` -mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=/root/git/t/arrow/cpp/release-build/release/ -DskipTests -Dcheckstyle.skip -``` - -* No rule to make target '../src/protobuf_ep', needed by `src/proto/Exprs.pb.cc' - -remove the existing libprotobuf installation, then the script for find_package() will be able to download protobuf. - -* can't find the libprotobuf.so.13 in the shared lib - -copy the libprotobuf.so.13 from $OAP_DIR/oap-native-sql/cpp/src/resources to /usr/lib64/ - -* unable to load libhdfs: libgsasl.so.7: cannot open shared object file - -libgsasl is missing, run `yum install libgsasl` - -* CentOS 7.7 looks like didn't provide the glibc we required, so binaries packaged on F30 won't work. - -``` -20/04/21 17:46:17 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2, 10.0.0.143, executor 6): java.lang.UnsatisfiedLinkError: /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336) -``` - -* Missing symbols due to old GCC version. - -``` -[root@vsr243 release-build]# nm /usr/local/lib64/libparquet.so | grep ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE -_ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE -``` - -Need to compile all packags with newer GCC: - -``` -[root@vsr243 ~]# export CXX=/usr/local/bin/g++ -[root@vsr243 ~]# export CC=/usr/local/bin/gcc -``` - -* Can not connect to hdfs @sr602 - -vsr606, vsr243 are both not able to connect to hdfs @sr602, need to skipTests to generate the jar - diff --git a/resource/Prerequisite.md b/resource/Prerequisite.md deleted file mode 100644 index b4bbe1226..000000000 --- a/resource/Prerequisite.md +++ /dev/null @@ -1,136 +0,0 @@ -# Prerequite -There are some requirements before you build the project. -Please make sure you have already installed the software in your system. - -1. gcc 9.3 or higher version -2. java8 OpenJDK -> yum install java-1.8.0-openjdk -3. cmake 3.2 or higher version -4. maven 3.1.1 or higher version -5. Hadoop 2.7.5 or higher version -6. Spark 3.0.0 or higher version -7. Intel Optimized Arrow 0.17.0 - -## gcc installation - -// installing gcc 9.3 or higher version -Please notes for better performance support, gcc 9.3 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE. -https://gcc.gnu.org/install/index.html -Follow the above website to download gcc. -C++ library may ask a certain version, if you are using gcc 9.3 the version would be libstdc++.so.6.0.28. -You may have to launch ./contrib/download_prerequisites command to install all the prerequisites for gcc. -If you are facing downloading issue in download_prerequisites command, you can try to change ftp to http. - -//Follow the steps to configure gcc -https://gcc.gnu.org/install/configure.html -If you are facing a multilib issue, you can try to add --disable-multilib parameter in ../configure - -//Follow the steps to build gcc -https://gcc.gnu.org/install/build.html - -//Follow the steps to install gcc -https://gcc.gnu.org/install/finalinstall.html - -//Set up Environment for new gcc -export PATH=$YOUR_GCC_INSTALLATION_DIR/bin:$PATH -export LD_LIBRARY_PATH=$YOUR_GCC_INSTALLATION_DIR/lib64:$LD_LIBRARY_PATH -Please remember to add and source the setup in your environment files such as /etc/profile or /etc/bashrc - -//Verify if gcc has been installation -Use gcc -v command to verify if your gcc version is correct.(Must larger than 9.3) - -## cmake installation -If you are facing some trouble when installing cmake, please follow below steps to install cmake. - -``` -// installing cmake 3.2 -sudo yum install cmake3 - -// If you have an existing cmake, you can use below command to set it as an option within alternatives command -sudo alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 10 --slave /usr/local/bin/ctest ctest /usr/bin/ctest --slave /usr/local/bin/cpack cpack /usr/bin/cpack --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake --family cmake - -// Set cmake3 as an option within alternatives command -sudo alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake3 20 --slave /usr/local/bin/ctest ctest /usr/bin/ctest3 --slave /usr/local/bin/cpack cpack /usr/bin/cpack3 --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake3 --family cmake - -// Use alternatives to choose cmake version -sudo alternatives --config cmake - -## maven installation - -If you are facing some trouble when installing maven, please follow below steps to install maven - -``` -// installing maven 3.6.3 -Go to https://maven.apache.org/download.cgi and download the specific version of maven - -// Below command use maven 3.6.3 as an example -wget htps://ftp.wayne.edu/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz -wget https://ftp.wayne.edu/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz -tar xzf apache-maven-3.6.3-bin.tar.gz -mkdir /usr/local/maven -mv apache-maven-3.6.3/ /usr/local/maven/ - -// Set maven 3.6.3 as an option within alternatives command -sudo alternatives --install /usr/bin/mvn mvn /usr/local/maven/apache-maven-3.6.3/bin/mvn 1 - -// Use alternatives to choose mvn version -sudo alternatives --config mvn -``` - -## HADOOP/SPARK Installation -If there is no existing Hadoop/Spark installed, Please follow the guide to install your Hadoop/Spark [SPARK/HADOOP Installation](/oap-native-sql/resource/SparkInstallation.md) - -### Hadoop Native Library(Default) - -Please make sure you have set up Hadoop directory properly with Hadoop Native Libraries -By default, Apache Arrow would scan `$HADOOP_HOME` and find the native Hadoop library `libhdfs.so`(under `$HADOOP_HOME/lib/native` directory) to be used for Hadoop client. - -You can also use `ARROW_LIBHDFS_DIR` to configure the location of `libhdfs.so` if it is installed in other directory than `$HADOOP_HOME/lib/native` - -If your SPARK and HADOOP are separated in different nodes, please find `libhdfs.so` in your Hadoop cluster and copy it to SPARK cluster, then use one of the above methods to set it properly. - -For more information, please check -Arrow HDFS interface [documentation](https://github.com/apache/arrow/blob/master/cpp/apidoc/HDFS.md) -Hadoop Native Library, please read the official Hadoop website [documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html) - -### Use libhdfs3 library for better performance(Optional) - -For better performance ArrowDataSource reads HDFS files using the third-party library `libhdfs3`. The library must be pre-installed on machines Spark Executor nodes are running on. - -To install the library, use of [Conda](https://docs.conda.io/en/latest/) is recommended. - -``` -// installing libhdfs3 -conda install -c conda-forge libhdfs3 - -// check the installed library file -ll ~/miniconda/envs/$(YOUR_ENV_NAME)/lib/libhdfs3.so -``` - -We also provide a libhdfs3 binary in cpp/src/resources directory. - -To set up libhdfs3, there are two different ways: -Option1: Overwrite the soft link for libhdfs.so -To install libhdfs3.so, you have to create a soft link for libhdfs.so in your Hadoop directory(`$HADOOP_HOME/lib/native` by default). - -``` -ln -f -s libhdfs3.so libhdfs.so -``` - -Option2: -Add env variable to the system -``` -export ARROW_LIBHDFS3_DIR="PATH_TO_LIBHDFS3_DIR/" -``` - -Add following Spark configuration options before running the DataSource to make the library to be recognized: -* `spark.executorEnv.ARROW_LIBHDFS3_DIR = "PATH_TO_LIBHDFS3_DIR/"` -* `spark.executorEnv.LD_LIBRARY_PATH = "PATH_TO_LIBHDFS3_DEPENDENCIES_DIR/"` - -Please notes: If you choose to use libhdfs3.so, there are some other dependency libraries you have to installed such as libprotobuf or libcrypto. - - -## Intel Optimized Apache Arrow Installation - -Intel Optimized Apache Arrow is MANDATORY to be used. However, we have a bundle a compiled arrow libraries(libarrow, libgandiva, libparquet) built by GCC9.3 included in the cpp/src/resources directory. -If you wish to build Apache Arrow by yourself, please follow the guide to build and install Apache Arrow [ArrowInstallation](/oap-native-sql/resource/ApacheArrowInstallation.md) - diff --git a/resource/SparkInstallation.md b/resource/SparkInstallation.md deleted file mode 100644 index 9d2a864ae..000000000 --- a/resource/SparkInstallation.md +++ /dev/null @@ -1,44 +0,0 @@ -### Download Spark 3.0.1 - -Currently Native SQL Engine works on the Spark 3.0.1 version. - -``` -wget http://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz -sudo mkdir -p /opt/spark && sudo mv spark-3.0.1-bin-hadoop3.2.tgz /opt/spark -sudo cd /opt/spark && sudo tar -xf spark-3.0.1-bin-hadoop3.2.tgz -export SPARK_HOME=/opt/spark/spark-3.0.1-bin-hadoop3.2/ -``` - -### [Or building Spark from source](https://spark.apache.org/docs/latest/building-spark.html) - -``` shell -git clone https://github.com/intel-bigdata/spark.git -cd spark && git checkout native-sql-engine-clean -# check spark supported hadoop version -grep \ -r pom.xml - 2.7.4 - 3.2.0 -# so we should build spark specifying hadoop version as 3.2 -./build/mvn -Pyarn -Phadoop-3.2 -Dhadoop.version=3.2.0 -DskipTests clean install -``` -Specify SPARK_HOME to spark path - -``` shell -export SPARK_HOME=${HADOOP_PATH} -``` - -### Hadoop building from source - -``` shell -git clone https://github.com/apache/hadoop.git -cd hadoop -git checkout rel/release-3.2.0 -# only build binary for hadoop -mvn clean install -Pdist -DskipTests -Dtar -# build binary and native library such as libhdfs.so for hadoop -# mvn clean install -Pdist,native -DskipTests -Dtar -``` - -``` shell -export HADOOP_HOME=${HADOOP_PATH}/hadoop-dist/target/hadoop-3.2.0/ -``` diff --git a/resource/columnar.png b/resource/columnar.png deleted file mode 100644 index d89074905..000000000 Binary files a/resource/columnar.png and /dev/null differ diff --git a/resource/core_arch.jpg b/resource/core_arch.jpg deleted file mode 100644 index 4f732a4ff..000000000 Binary files a/resource/core_arch.jpg and /dev/null differ diff --git a/resource/dataset.png b/resource/dataset.png deleted file mode 100644 index 5d3e607ab..000000000 Binary files a/resource/dataset.png and /dev/null differ diff --git a/resource/kernel.png b/resource/kernel.png deleted file mode 100644 index f88b002aa..000000000 Binary files a/resource/kernel.png and /dev/null differ diff --git a/resource/nativesql_arch.png b/resource/nativesql_arch.png deleted file mode 100644 index a8304f5af..000000000 Binary files a/resource/nativesql_arch.png and /dev/null differ diff --git a/resource/performance.png b/resource/performance.png deleted file mode 100644 index a4351cd9a..000000000 Binary files a/resource/performance.png and /dev/null differ diff --git a/resource/shuffle.png b/resource/shuffle.png deleted file mode 100644 index 504234536..000000000 Binary files a/resource/shuffle.png and /dev/null differ