Steps to package and publish (also described in sarplus.yml):
-
Package and publish the pip package. For databricks to properly install a C++ extension, one must take a detour through pypi. Use twine to upload the package to pypi.
# build dependencies python -m pip install -U build cibuildwheel pip twine cd python cp ../VERSION ./pysarplus/ # copy version file python -m build --sdist MINOR_VERSION=$(python --version | cut -d '.' -f 2) for MINOR_VERSION in {6..10}; do CIBW_BUILD="cp3${MINOR_VERSION}-manylinux_x86_64" python -m cibuildwheel --platform linux --output-dir dist done python -m twine upload dist/*
-
Package the Scala package, which includes the Scala formatter and references the pip package.
export SARPLUS_VERSION=$(cat VERSION) GPG_KEY="<gpg-private-key>" GPG_KEY_ID="<gpg-key-id>" cd scala # generate artifacts export SPARK_VERSION="3.1.2" export HADOOP_VERSION="2.7.4" export SCALA_VERSION="2.12.10" sbt ++${SCALA_VERSION}! package packageDoc packageSrc makePom # generate the artifact (sarplus-spark-3-2-plus*.jar) for Spark 3.2+ export SPARK_VERSION="3.2.1" export HADOOP_VERSION="3.3.1" export SCALA_VERSION="2.12.14" sbt ++${SCALA_VERSION}! package packageDoc packageSrc makePom # sign with GPG cd target/scala-${SCALA_VERSION%.*} gpg --import <(cat <<< "${GPG_KEY}") for file in {*.jar,*.pom}; do gpg -ab -u "${GPG_KEY_ID}" "${file}"; done # bundle jar cvf sarplus-bundle_2.12-${SARPLUS_VERSION}.jar sarplus_*.jar sarplus_*.pom sarplus_*.asc jar cvf sarplus-spark-3.2-plus-bundle_2.12-${SARPLUS_VERSION}.jar sarplus-spark*.jar sarplus-spark*.pom sarplus-spark*.asc
where
SPARK_VERSION
,HADOOP_VERSION
,SCALA_VERSION
should be customized as needed. -
Upload the zipped Scala package bundle to Nexus Repository Manager through a browser (See publish manual).
To test the python UDF + C++ backend
# dependencies
python -m pip install -U build pip twine
python -m pip install -U flake8 pytest pytest-cov scikit-learn
# build
cd python
cp ../VERSION ./pysarplus/ # version file
python -m build --sdist
# test
pytest ./tests
To test the Scala formatter
export SPARK_VERSION=3.2.1
export HADOOP_VERSION=3.3.1
export SCALA_VERSION=2.12.14
cd scala
sbt ++${SCALA_VERSION}! test
The code now has been modified to support Spark 3.x, and has been
tested under Azure Synapse Apache Spark 3.1 runtime and different
versions of Databricks Runtime (including 6.4 Extended Support, 7.3
LTS, 9.1 LTS and 10.4 LTS) on Azure Databricks Service. However,
there is a breaking change of
org/apache.spark.sql.execution.datasources.OutputWriter
on Spark 3.2, which adds an extra function path()
, so an
additional package called Sarplus Spark 3.2
Plus
(with Maven coordinate such as
com.microsoft.sarplus:sarplus-spark-3-2-plus_2.12:0.6.6
) should be
used if running on Spark 3.2 instead of
Sarplus
(with Maven coordinate like
com.microsoft.sarplus:sarplus_2.12:0.6.6
).
In addition to spark.sql.crossJoin.enabled true
, extra
configurations are required when running on Spark 3.x:
spark.sql.sources.default parquet
spark.sql.legacy.createHiveTableByDefault true