Packaging

Steps to package and publish (also described in sarplus.yml):

Package and publish the pip package. For databricks to properly install a C++ extension, one must take a detour through pypi. Use twine to upload the package to pypi.

# build dependencies
python -m pip install -U build cibuildwheel pip twine

cd python
cp ../VERSION ./pysarplus/  # copy version file
python -m build --sdist
MINOR_VERSION=$(python --version | cut -d '.' -f 2)
for MINOR_VERSION in {6..10}; do
  CIBW_BUILD="cp3${MINOR_VERSION}-manylinux_x86_64" python -m cibuildwheel --platform linux --output-dir dist
done
python -m twine upload dist/*

Package the Scala package, which includes the Scala formatter and references the pip package.

export SARPLUS_VERSION=$(cat VERSION)
GPG_KEY="<gpg-private-key>"
GPG_KEY_ID="<gpg-key-id>"
cd scala

# generate artifacts
export SPARK_VERSION="3.1.2"
export HADOOP_VERSION="2.7.4"
export SCALA_VERSION="2.12.10"
sbt ++${SCALA_VERSION}! package packageDoc packageSrc makePom

# generate the artifact (sarplus-spark-3-2-plus*.jar) for Spark 3.2+
export SPARK_VERSION="3.2.1"
export HADOOP_VERSION="3.3.1"
export SCALA_VERSION="2.12.14"
sbt ++${SCALA_VERSION}! package packageDoc packageSrc makePom

# sign with GPG
cd target/scala-${SCALA_VERSION%.*}
gpg --import <(cat <<< "${GPG_KEY}")
for file in {*.jar,*.pom}; do gpg -ab -u "${GPG_KEY_ID}" "${file}"; done

# bundle
jar cvf sarplus-bundle_2.12-${SARPLUS_VERSION}.jar sarplus_*.jar sarplus_*.pom sarplus_*.asc
jar cvf sarplus-spark-3.2-plus-bundle_2.12-${SARPLUS_VERSION}.jar sarplus-spark*.jar sarplus-spark*.pom sarplus-spark*.asc

where SPARK_VERSION, HADOOP_VERSION, SCALA_VERSION should be customized as needed.

Upload the zipped Scala package bundle to Nexus Repository Manager through a browser (See publish manual).

Testing

To test the python UDF + C++ backend

# dependencies
python -m pip install -U build pip twine
python -m pip install -U flake8 pytest pytest-cov scikit-learn

# build
cd python
cp ../VERSION ./pysarplus/  # version file
python -m build --sdist

# test
pytest ./tests

To test the Scala formatter

export SPARK_VERSION=3.2.1
export HADOOP_VERSION=3.3.1
export SCALA_VERSION=2.12.14

cd scala
sbt ++${SCALA_VERSION}! test

Notes for Spark 3.x

The code now has been modified to support Spark 3.x, and has been tested under Azure Synapse Apache Spark 3.1 runtime and different versions of Databricks Runtime (including 6.4 Extended Support, 7.3 LTS, 9.1 LTS and 10.4 LTS) on Azure Databricks Service. However, there is a breaking change of org/apache.spark.sql.execution.datasources.OutputWriter on Spark 3.2, which adds an extra function path(), so an additional package called Sarplus Spark 3.2 Plus (with Maven coordinate such as com.microsoft.sarplus:sarplus-spark-3-2-plus_2.12:0.6.6) should be used if running on Spark 3.2 instead of Sarplus (with Maven coordinate like com.microsoft.sarplus:sarplus_2.12:0.6.6).

In addition to spark.sql.crossJoin.enabled true, extra configurations are required when running on Spark 3.x:

spark.sql.sources.default parquet
spark.sql.legacy.createHiveTableByDefault true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEVELOPMENT.md

DEVELOPMENT.md

Packaging

Testing

Notes for Spark 3.x

Files

DEVELOPMENT.md

Latest commit

History

DEVELOPMENT.md

File metadata and controls

Packaging

Testing

Notes for Spark 3.x