Skip to content

Latest commit

 

History

History
120 lines (95 loc) · 3.96 KB

DEVELOPMENT.md

File metadata and controls

120 lines (95 loc) · 3.96 KB

Packaging

Steps to package and publish (also described in sarplus.yml):

  1. Package and publish the pip package. For databricks to properly install a C++ extension, one must take a detour through pypi. Use twine to upload the package to pypi.

    # build dependencies
    python -m pip install -U build cibuildwheel pip twine
    
    cd python
    cp ../VERSION ./pysarplus/  # copy version file
    python -m build --sdist
    MINOR_VERSION=$(python --version | cut -d '.' -f 2)
    for MINOR_VERSION in {6..10}; do
      CIBW_BUILD="cp3${MINOR_VERSION}-manylinux_x86_64" python -m cibuildwheel --platform linux --output-dir dist
    done
    python -m twine upload dist/*
  2. Package the Scala package, which includes the Scala formatter and references the pip package.

    export SARPLUS_VERSION=$(cat VERSION)
    GPG_KEY="<gpg-private-key>"
    GPG_KEY_ID="<gpg-key-id>"
    cd scala
    
    # generate artifacts
    export SPARK_VERSION="3.1.2"
    export HADOOP_VERSION="2.7.4"
    export SCALA_VERSION="2.12.10"
    sbt ++${SCALA_VERSION}! package packageDoc packageSrc makePom
    
    # generate the artifact (sarplus-spark-3-2-plus*.jar) for Spark 3.2+
    export SPARK_VERSION="3.2.1"
    export HADOOP_VERSION="3.3.1"
    export SCALA_VERSION="2.12.14"
    sbt ++${SCALA_VERSION}! package packageDoc packageSrc makePom
    
    # sign with GPG
    cd target/scala-${SCALA_VERSION%.*}
    gpg --import <(cat <<< "${GPG_KEY}")
    for file in {*.jar,*.pom}; do gpg -ab -u "${GPG_KEY_ID}" "${file}"; done
    
    # bundle
    jar cvf sarplus-bundle_2.12-${SARPLUS_VERSION}.jar sarplus_*.jar sarplus_*.pom sarplus_*.asc
    jar cvf sarplus-spark-3.2-plus-bundle_2.12-${SARPLUS_VERSION}.jar sarplus-spark*.jar sarplus-spark*.pom sarplus-spark*.asc

    where SPARK_VERSION, HADOOP_VERSION, SCALA_VERSION should be customized as needed.

  3. Upload the zipped Scala package bundle to Nexus Repository Manager through a browser (See publish manual).

Testing

To test the python UDF + C++ backend

# dependencies
python -m pip install -U build pip twine
python -m pip install -U flake8 pytest pytest-cov scikit-learn

# build
cd python
cp ../VERSION ./pysarplus/  # version file
python -m build --sdist

# test
pytest ./tests

To test the Scala formatter

export SPARK_VERSION=3.2.1
export HADOOP_VERSION=3.3.1
export SCALA_VERSION=2.12.14

cd scala
sbt ++${SCALA_VERSION}! test

Notes for Spark 3.x

The code now has been modified to support Spark 3.x, and has been tested under Azure Synapse Apache Spark 3.1 runtime and different versions of Databricks Runtime (including 6.4 Extended Support, 7.3 LTS, 9.1 LTS and 10.4 LTS) on Azure Databricks Service. However, there is a breaking change of org/apache.spark.sql.execution.datasources.OutputWriter on Spark 3.2, which adds an extra function path(), so an additional package called Sarplus Spark 3.2 Plus (with Maven coordinate such as com.microsoft.sarplus:sarplus-spark-3-2-plus_2.12:0.6.6) should be used if running on Spark 3.2 instead of Sarplus (with Maven coordinate like com.microsoft.sarplus:sarplus_2.12:0.6.6).

In addition to spark.sql.crossJoin.enabled true, extra configurations are required when running on Spark 3.x:

spark.sql.sources.default parquet
spark.sql.legacy.createHiveTableByDefault true