Add PySparkModelArtifact to support Spark MLLib #957

joshuacwnewton · 2020-08-05T17:05:02Z

Description

Adds:

PySparkModelArtifact in bentoml/artifact/pyspark_model_artifact.py
Corresponding entry to bentoml/artifact/__init__.py
Example service (PySparkClassifier) in tests/bento_service_examples
Integration tests for example service
- On its own
- After having been saved and loaded
- As a REST API server
- As a containerized Docker API server (Note: unsure how to handle Spark dependencies for now)
Update .travis.yml to properly deal with Spark dependencies

Requesting code review to discuss design choices. This is my first time using Spark/PySpark, so any comments about idiomatic Spark code is much appreciated! (Also, PySparkModelArtifact contains TODOs referencing some of the design details in #666 (comment).)

Thanks much! 😄

Motivation and Context

Fixes #666.

How Has This Been Tested?

Tests included in this PR were run in the following environment:

Spark installation: console output hidden, click to show

mlh-dev@pop-os:~$ spark-shell
20/08/05 09:44:49 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.1.243 instead (on interface wlp2s0)
20/08/05 09:44:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/05 09:44:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://pop-os.lan:4040
Spark context available as 'sc' (master = local[*], app id = local-1596645894219).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.8)

Environment packages: console output hidden, click to show

(BentoML) mlh-dev@pop-os:~/PycharmProjects/BentoML$ conda list
# packages in environment at /home/mlh-dev/miniconda3/envs/BentoML:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
aiohttp                   3.6.2                    pypi_0    pypi
alembic                   1.4.2                    pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
arrow                     0.15.8                   pypi_0    pypi
astroid                   2.4.2                    pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     19.3.0                   pypi_0    pypi
aws-lambda-builders       0.6.0                    pypi_0    pypi
aws-sam-cli               0.33.1                   pypi_0    pypi
aws-sam-translator        1.15.1                   pypi_0    pypi
aws-xray-sdk              2.6.0                    pypi_0    pypi
bentoml                   0.8.3+60.g5d10f19.dirty           dev_0    <develop>
binaryornot               0.4.4                    pypi_0    pypi
black                     19.10b0                  pypi_0    pypi
boto                      2.49.0                   pypi_0    pypi
boto3                     1.14.34                  pypi_0    pypi
botocore                  1.17.34                  pypi_0    pypi
ca-certificates           2020.6.24                     0  
cerberus                  1.3.2                    pypi_0    pypi
certifi                   2020.6.20                py37_0  
cffi                      1.14.1                   pypi_0    pypi
cfn-lint                  0.34.1                   pypi_0    pypi
chardet                   3.0.4                    pypi_0    pypi
chevron                   0.13.1                   pypi_0    pypi
click                     7.1.2                    pypi_0    pypi
codecov                   2.1.8                    pypi_0    pypi
configparser              5.0.0                    pypi_0    pypi
cookiecutter              1.6.0                    pypi_0    pypi
coverage                  5.2.1                    pypi_0    pypi
cryptography              3.0                      pypi_0    pypi
dateparser                0.7.6                    pypi_0    pypi
decorator                 4.4.2                    pypi_0    pypi
docker                    4.2.2                    pypi_0    pypi
docutils                  0.15.2                   pypi_0    pypi
ecdsa                     0.14.1                   pypi_0    pypi
findspark                 1.4.2                    pypi_0    pypi
flake8                    3.8.3                    pypi_0    pypi
flask                     1.1.2                    pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
grpcio                    1.27.2                   pypi_0    pypi
gunicorn                  20.0.4                   pypi_0    pypi
humanfriendly             8.2                      pypi_0    pypi
idna                      2.10                     pypi_0    pypi
imageio                   2.9.0                    pypi_0    pypi
importlib-metadata        1.7.0                    pypi_0    pypi
iniconfig                 1.0.1                    pypi_0    pypi
isort                     4.3.21                   pypi_0    pypi
itsdangerous              1.1.0                    pypi_0    pypi
jinja2                    2.11.2                   pypi_0    pypi
jinja2-time               0.2.0                    pypi_0    pypi
jmespath                  0.10.0                   pypi_0    pypi
joblib                    0.16.0                   pypi_0    pypi
jsondiff                  1.1.2                    pypi_0    pypi
jsonpatch                 1.26                     pypi_0    pypi
jsonpickle                1.4.1                    pypi_0    pypi
jsonpointer               2.0                      pypi_0    pypi
jsonschema                3.2.0                    pypi_0    pypi
junit-xml                 1.9                      pypi_0    pypi
lazy-object-proxy         1.4.3                    pypi_0    pypi
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20191231         h14c3975_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
mako                      1.1.3                    pypi_0    pypi
markupsafe                1.1.1                    pypi_0    pypi
mccabe                    0.6.1                    pypi_0    pypi
mock                      4.0.2                    pypi_0    pypi
more-itertools            8.4.0                    pypi_0    pypi
moto                      1.3.14                   pypi_0    pypi
multidict                 4.7.6                    pypi_0    pypi
ncurses                   6.2                  he6710b0_1  
networkx                  2.4                      pypi_0    pypi
numpy                     1.19.1                   pypi_0    pypi
openssl                   1.1.1g               h7b6447c_0  
packaging                 20.4                     pypi_0    pypi
pandas                    1.1.0                    pypi_0    pypi
pathspec                  0.8.0                    pypi_0    pypi
pillow                    7.2.0                    pypi_0    pypi
pip                       20.1.1                   py37_1  
pluggy                    0.13.1                   pypi_0    pypi
ply                       3.11                     pypi_0    pypi
poyo                      0.5.0                    pypi_0    pypi
prometheus-client         0.8.0                    pypi_0    pypi
protobuf                  3.12.4                   pypi_0    pypi
psutil                    5.7.2                    pypi_0    pypi
py                        1.9.0                    pypi_0    pypi
py-zipkin                 0.20.0                   pypi_0    pypi
py4j                      0.10.9                   pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pycodestyle               2.6.0                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
pyflakes                  2.2.0                    pypi_0    pypi
pylint                    2.5.3                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyrsistent                0.16.0                   pypi_0    pypi
pyspark                   3.0.0                    pypi_0    pypi
pytest                    6.0.1                    pypi_0    pypi
pytest-asyncio            0.14.0                   pypi_0    pypi
pytest-cov                2.10.0                   pypi_0    pypi
pytest-spark              0.6.0                    pypi_0    pypi
python                    3.7.7                hcff3b4d_5  
python-dateutil           2.8.0                    pypi_0    pypi
python-editor             1.0.4                    pypi_0    pypi
python-jose               3.2.0                    pypi_0    pypi
python-json-logger        0.1.11                   pypi_0    pypi
pytz                      2020.1                   pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
readline                  8.0                  h7b6447c_0  
regex                     2020.7.14                pypi_0    pypi
requests                  2.24.0                   pypi_0    pypi
responses                 0.10.15                  pypi_0    pypi
rsa                       4.6                      pypi_0    pypi
ruamel-yaml               0.16.10                  pypi_0    pypi
ruamel-yaml-clib          0.2.0                    pypi_0    pypi
s3transfer                0.3.3                    pypi_0    pypi
scikit-learn              0.23.2                   pypi_0    pypi
scipy                     1.5.2                    pypi_0    pypi
serverlessrepo            0.1.9                    pypi_0    pypi
setuptools                49.2.0                   py37_0  
six                       1.15.0                   pypi_0    pypi
sqlalchemy                1.3.18                   pypi_0    pypi
sqlalchemy-utils          0.36.8                   pypi_0    pypi
sqlite                    3.32.3               h62c20be_0  
sshpubkeys                3.1.0                    pypi_0    pypi
tabulate                  0.8.7                    pypi_0    pypi
threadpoolctl             2.1.0                    pypi_0    pypi
thriftpy2                 0.4.11                   pypi_0    pypi
tk                        8.6.10               hbc83047_0  
toml                      0.10.1                   pypi_0    pypi
tomlkit                   0.5.8                    pypi_0    pypi
typed-ast                 1.4.1                    pypi_0    pypi
typing-extensions         3.7.4.2                  pypi_0    pypi
tzlocal                   2.1                      pypi_0    pypi
urllib3                   1.25.10                  pypi_0    pypi
websocket-client          0.57.0                   pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.34.2                   py37_0  
whichcraft                0.6.1                    pypi_0    pypi
wrapt                     1.12.1                   pypi_0    pypi
xmltodict                 0.12.0                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
yarl                      1.5.1                    pypi_0    pypi
zipp                      3.1.0                    pypi_0    pypi
zlib                      1.2.11               h7b6447c_3

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature and improvements (non-breaking change which adds/improves functionality)
Bug fix (non-breaking change which fixes an issue)
Code Refactoring (internal change which is not user facing)
Documentation
Test, CI, or build

Component(s) if applicable

BentoService (service definition, dependency management, API input/output adapters)
Model Artifact (model serialization, multi-framework support)
Model Server (mico-batching, dockerisation, logging, OpenAPI, instruments)
YataiService gRPC server (model registry, cloud deployment automation)
YataiService web server (nodejs HTTP server and web UI)
Internal (BentoML's own configuration, logging, utility, exception handling)
BentoML CLI

Checklist:

My code follows the bentoml code style, both ./dev/format.sh and
./dev/lint.sh script have passed
(instructions).
My change reduces project test coverage and requires unit tests to be added
I have added unit tests covering my code change
My change requires a change to the documentation
I have updated the documentation accordingly

These tests are just demos to ensure that Spark has been installed correctly and PySpark can be used. They are not the proper tests for a PySparkSavedModelArtifact, and should be replaced when further code has been written.

* Testing is far from complete (verifying prediction from PySpark model) * PySparkModelArtifact has many TODOs that need to be addressed.

Were only used to sanity-check that PySpark itself was setup correctly.

Was presented in API proposal, but I'm unsure of its use. Will discuss with BentoML maintainers when I create a PR.

Used to return Pandas DF. Harder to assert values this way, so a NumPy array is returned instead. Also, adapt existing tests to work this way as well.

* model_class -> ModelClass * Remove comment

joshuacwnewton · 2020-08-06T23:57:03Z

Disclaimer -- tests in PR only work locally when Spark is installed

PySpark does not work out of the box when pip install'd, because it has external Spark dependencies. I believe Java, Scala, and the Spark JAR files must be present. This affects:

Travis (although there are example workarounds, see below)
- Example 1: this travis-pytest-spark repo which uses the pytest-spark package
- Example 2: this other example install script which handles a potential Java version mismatch with Travis.)
Docker integration (If I understand correctly, with the current Dockerfile that BentoML generates, the resulting image would be missing these dependencies?)

Right now I am looking more closely at the Clipper example to understand how they handled this issue, but I would love to hear maintainer opinions about how dependencies may have been handled in other cases (e.g. in relation to Docker image size concerns.)

Talador12 · 2020-08-20T16:57:07Z

+1 keep at it @joshuacwnewton

Let me know if there is any way we can help this initiative.

yubozhao · 2020-08-20T23:05:25Z

@joshuacwnewton one way to include java(openjdk) is using Conda.

We can add env.add_conda_dependencies as part of the PySparkModel's set_dependencies

class PysparkModelArtifact(BentoServiceArtifact):
    def set_dependencies(self, env: BentoServiceEnv):
        env.add_conda_dependencies(['openjdk'])
        env.add_pip_dependencies_if_missing(['pyspark'])

I did a quick test and it works well.

joshuacwnewton · 2020-08-21T17:04:50Z

My apologies, but my personal situation has changed, and I'm no longer able to work on this feature. 🙁

@yubozhao Good point! That just leaves Spark (and possibly Scala).

@Talador12 What needs help most is figuring out how to provide the dependencies PySpark needs. The questions I'm left with are:

Scala
- Is there a way to handle the Scala dependency programmatically (as @yubozhao has done with JDK)?
- Do we even need to install Scala if we use a prebuilt binary package? See this Spark 3.0.0 installation guide that doesn't even bother with Scala.
Spark
- For Spark, is it enough to simply download a prebuilt binary package, untar it, and set the necessary environment variables (as is done in travis-pytest-spark)? Or, is anything more complex needed?
- How should we handle cases where a user already has Spark installed? Does the version of Spark change how we handle an existing installation?
- Is there a lightweight way to incorporate PySpark's dependencies into a BentoML docker image?
Does the installation of these dependencies vary between platforms?

Sorry I can't be of more help! Best of luck on this.

yubozhao · 2020-08-22T01:21:56Z

@joshuacwnewton I am sorry to hear that your situation has changed.

I can't thank you enough for everything you did for this community. It was great to work with you. I am looking forward to your return in the future.

codecov · 2020-08-26T00:14:35Z

Codecov Report

Merging #957 into master will decrease coverage by 0.25%.
The diff coverage is 59.57%.

@@            Coverage Diff             @@
##           master     #957      +/-   ##
==========================================
- Coverage   62.86%   62.61%   -0.26%     
==========================================
  Files         123      126       +3     
  Lines        8112     8175      +63     
==========================================
+ Hits         5100     5119      +19     
- Misses       3012     3056      +44

Impacted Files	Coverage Δ
bentoml/adapters/json_input.py	`57.89% <ø> (ø)`
bentoml/artifact/pyspark_model_artifact.py	`29.62% <29.62%> (ø)`
bentoml/artifact/__init__.py	`100.00% <100.00%> (ø)`
bentoml/artifact/artifact.py	`92.78% <100.00%> (-2.40%)`	⬇️
bentoml/saved_bundle/bundler.py	`88.50% <100.00%> (-0.14%)`	⬇️
bentoml/service.py	`88.01% <100.00%> (-0.29%)`	⬇️
bentoml/saved_bundle/pip_pkg.py	`81.29% <0.00%> (-13.24%)`	⬇️
bentoml/yatai/deployment/operator.py	`48.00% <0.00%> (-7.18%)`	⬇️
bentoml/cli/config.py	`31.50% <0.00%> (-5.21%)`	⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3633032...3ab2175. Read the comment docs.

stale · 2021-01-02T04:29:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

joshuacwnewton added 15 commits August 4, 2020 20:20

Add basic PySpark tests (non-BentoML)

5b3815a

These tests are just demos to ensure that Spark has been installed correctly and PySpark can be used. They are not the proper tests for a PySparkSavedModelArtifact, and should be replaced when further code has been written.

WIP: Add PySparkModelArtifact support to BentoML

4293e72

* Testing is far from complete (verifying prediction from PySpark model) * PySparkModelArtifact has many TODOs that need to be addressed.

Move prediction logic inside example PySpark service

dcc76d4

Remove PySpark demo tests

b9cca18

Were only used to sanity-check that PySpark itself was setup correctly.

Stop using spark_version parameter

4fa4f4f

Was presented in API proposal, but I'm unsure of its use. Will discuss with BentoML maintainers when I create a PR.

Add to-dos to PySpark test script

72c7ab7

Refactor PySpark tests into fixtures

38d69c9

WIP: Testing permissions between acounts

4635c7f

Return ndarray from PySparkClassifier for API test

20339d8

Used to return Pandas DF. Harder to assert values this way, so a NumPy array is returned instead. Also, adapt existing tests to work this way as well.

Make cosmetic changes to pyspark_model_artifact.py

a340b91

* model_class -> ModelClass * Remove comment

Add metadata exception to PySparkModelArtifact

5d10f19

Make model_path usage in load() more explicit

57dc153

Remove unused sample data .txt file

44e236e

Remove check for unused import statement

6141475

Run /dev/format.sh and /dev/lint.sh

b10b5ca

joshuacwnewton marked this pull request as draft August 6, 2020 23:36

Talador12 mentioned this pull request Aug 20, 2020

feat: Support distributed batch inferencing job on Apache Spark cluster #890

Closed

yubozhao added the help-wanted An issue currently lacks a contributor label Aug 24, 2020

Add openjdk as part of dependencies

3ab2175

parano mentioned this pull request Dec 23, 2020

framework: PySpark MLLib #1343

Closed

18 tasks

stale bot added the stale label Jan 2, 2021

parano closed this Jan 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PySparkModelArtifact to support Spark MLLib #957

Add PySparkModelArtifact to support Spark MLLib #957

joshuacwnewton commented Aug 5, 2020 •

edited

Loading

joshuacwnewton commented Aug 6, 2020 •

edited

Loading

Talador12 commented Aug 20, 2020

yubozhao commented Aug 20, 2020 •

edited

Loading

joshuacwnewton commented Aug 21, 2020 •

edited

Loading

yubozhao commented Aug 22, 2020

codecov bot commented Aug 26, 2020 •

edited

Loading

stale bot commented Jan 2, 2021

Add PySparkModelArtifact to support Spark MLLib #957

Add PySparkModelArtifact to support Spark MLLib #957

Conversation

joshuacwnewton commented Aug 5, 2020 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Component(s) if applicable

Checklist:

joshuacwnewton commented Aug 6, 2020 • edited Loading

Disclaimer -- tests in PR only work locally when Spark is installed

Talador12 commented Aug 20, 2020

yubozhao commented Aug 20, 2020 • edited Loading

joshuacwnewton commented Aug 21, 2020 • edited Loading

yubozhao commented Aug 22, 2020

codecov bot commented Aug 26, 2020 • edited Loading

Codecov Report

stale bot commented Jan 2, 2021

joshuacwnewton commented Aug 5, 2020 •

edited

Loading

joshuacwnewton commented Aug 6, 2020 •

edited

Loading

yubozhao commented Aug 20, 2020 •

edited

Loading

joshuacwnewton commented Aug 21, 2020 •

edited

Loading

codecov bot commented Aug 26, 2020 •

edited

Loading