Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PySparkModelArtifact to support Spark MLLib #957

Closed

Conversation

joshuacwnewton
Copy link
Contributor

@joshuacwnewton joshuacwnewton commented Aug 5, 2020

Description

Adds:

  • PySparkModelArtifact in bentoml/artifact/pyspark_model_artifact.py
  • Corresponding entry to bentoml/artifact/__init__.py
  • Example service (PySparkClassifier) in tests/bento_service_examples
  • Integration tests for example service
    • On its own
    • After having been saved and loaded
    • As a REST API server
    • As a containerized Docker API server (Note: unsure how to handle Spark dependencies for now)
  • Update .travis.yml to properly deal with Spark dependencies

Requesting code review to discuss design choices. This is my first time using Spark/PySpark, so any comments about idiomatic Spark code is much appreciated! (Also, PySparkModelArtifact contains TODOs referencing some of the design details in #666 (comment).)

Thanks much! 😄

Motivation and Context

Fixes #666.

How Has This Been Tested?

Tests included in this PR were run in the following environment:

Spark installation: console output hidden, click to show
mlh-dev@pop-os:~$ spark-shell
20/08/05 09:44:49 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.1.243 instead (on interface wlp2s0)
20/08/05 09:44:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/05 09:44:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://pop-os.lan:4040
Spark context available as 'sc' (master = local[*], app id = local-1596645894219).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.8)
Environment packages: console output hidden, click to show
(BentoML) mlh-dev@pop-os:~/PycharmProjects/BentoML$ conda list
# packages in environment at /home/mlh-dev/miniconda3/envs/BentoML:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
aiohttp                   3.6.2                    pypi_0    pypi
alembic                   1.4.2                    pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
arrow                     0.15.8                   pypi_0    pypi
astroid                   2.4.2                    pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     19.3.0                   pypi_0    pypi
aws-lambda-builders       0.6.0                    pypi_0    pypi
aws-sam-cli               0.33.1                   pypi_0    pypi
aws-sam-translator        1.15.1                   pypi_0    pypi
aws-xray-sdk              2.6.0                    pypi_0    pypi
bentoml                   0.8.3+60.g5d10f19.dirty           dev_0    <develop>
binaryornot               0.4.4                    pypi_0    pypi
black                     19.10b0                  pypi_0    pypi
boto                      2.49.0                   pypi_0    pypi
boto3                     1.14.34                  pypi_0    pypi
botocore                  1.17.34                  pypi_0    pypi
ca-certificates           2020.6.24                     0  
cerberus                  1.3.2                    pypi_0    pypi
certifi                   2020.6.20                py37_0  
cffi                      1.14.1                   pypi_0    pypi
cfn-lint                  0.34.1                   pypi_0    pypi
chardet                   3.0.4                    pypi_0    pypi
chevron                   0.13.1                   pypi_0    pypi
click                     7.1.2                    pypi_0    pypi
codecov                   2.1.8                    pypi_0    pypi
configparser              5.0.0                    pypi_0    pypi
cookiecutter              1.6.0                    pypi_0    pypi
coverage                  5.2.1                    pypi_0    pypi
cryptography              3.0                      pypi_0    pypi
dateparser                0.7.6                    pypi_0    pypi
decorator                 4.4.2                    pypi_0    pypi
docker                    4.2.2                    pypi_0    pypi
docutils                  0.15.2                   pypi_0    pypi
ecdsa                     0.14.1                   pypi_0    pypi
findspark                 1.4.2                    pypi_0    pypi
flake8                    3.8.3                    pypi_0    pypi
flask                     1.1.2                    pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
grpcio                    1.27.2                   pypi_0    pypi
gunicorn                  20.0.4                   pypi_0    pypi
humanfriendly             8.2                      pypi_0    pypi
idna                      2.10                     pypi_0    pypi
imageio                   2.9.0                    pypi_0    pypi
importlib-metadata        1.7.0                    pypi_0    pypi
iniconfig                 1.0.1                    pypi_0    pypi
isort                     4.3.21                   pypi_0    pypi
itsdangerous              1.1.0                    pypi_0    pypi
jinja2                    2.11.2                   pypi_0    pypi
jinja2-time               0.2.0                    pypi_0    pypi
jmespath                  0.10.0                   pypi_0    pypi
joblib                    0.16.0                   pypi_0    pypi
jsondiff                  1.1.2                    pypi_0    pypi
jsonpatch                 1.26                     pypi_0    pypi
jsonpickle                1.4.1                    pypi_0    pypi
jsonpointer               2.0                      pypi_0    pypi
jsonschema                3.2.0                    pypi_0    pypi
junit-xml                 1.9                      pypi_0    pypi
lazy-object-proxy         1.4.3                    pypi_0    pypi
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20191231         h14c3975_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
mako                      1.1.3                    pypi_0    pypi
markupsafe                1.1.1                    pypi_0    pypi
mccabe                    0.6.1                    pypi_0    pypi
mock                      4.0.2                    pypi_0    pypi
more-itertools            8.4.0                    pypi_0    pypi
moto                      1.3.14                   pypi_0    pypi
multidict                 4.7.6                    pypi_0    pypi
ncurses                   6.2                  he6710b0_1  
networkx                  2.4                      pypi_0    pypi
numpy                     1.19.1                   pypi_0    pypi
openssl                   1.1.1g               h7b6447c_0  
packaging                 20.4                     pypi_0    pypi
pandas                    1.1.0                    pypi_0    pypi
pathspec                  0.8.0                    pypi_0    pypi
pillow                    7.2.0                    pypi_0    pypi
pip                       20.1.1                   py37_1  
pluggy                    0.13.1                   pypi_0    pypi
ply                       3.11                     pypi_0    pypi
poyo                      0.5.0                    pypi_0    pypi
prometheus-client         0.8.0                    pypi_0    pypi
protobuf                  3.12.4                   pypi_0    pypi
psutil                    5.7.2                    pypi_0    pypi
py                        1.9.0                    pypi_0    pypi
py-zipkin                 0.20.0                   pypi_0    pypi
py4j                      0.10.9                   pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pycodestyle               2.6.0                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
pyflakes                  2.2.0                    pypi_0    pypi
pylint                    2.5.3                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyrsistent                0.16.0                   pypi_0    pypi
pyspark                   3.0.0                    pypi_0    pypi
pytest                    6.0.1                    pypi_0    pypi
pytest-asyncio            0.14.0                   pypi_0    pypi
pytest-cov                2.10.0                   pypi_0    pypi
pytest-spark              0.6.0                    pypi_0    pypi
python                    3.7.7                hcff3b4d_5  
python-dateutil           2.8.0                    pypi_0    pypi
python-editor             1.0.4                    pypi_0    pypi
python-jose               3.2.0                    pypi_0    pypi
python-json-logger        0.1.11                   pypi_0    pypi
pytz                      2020.1                   pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
readline                  8.0                  h7b6447c_0  
regex                     2020.7.14                pypi_0    pypi
requests                  2.24.0                   pypi_0    pypi
responses                 0.10.15                  pypi_0    pypi
rsa                       4.6                      pypi_0    pypi
ruamel-yaml               0.16.10                  pypi_0    pypi
ruamel-yaml-clib          0.2.0                    pypi_0    pypi
s3transfer                0.3.3                    pypi_0    pypi
scikit-learn              0.23.2                   pypi_0    pypi
scipy                     1.5.2                    pypi_0    pypi
serverlessrepo            0.1.9                    pypi_0    pypi
setuptools                49.2.0                   py37_0  
six                       1.15.0                   pypi_0    pypi
sqlalchemy                1.3.18                   pypi_0    pypi
sqlalchemy-utils          0.36.8                   pypi_0    pypi
sqlite                    3.32.3               h62c20be_0  
sshpubkeys                3.1.0                    pypi_0    pypi
tabulate                  0.8.7                    pypi_0    pypi
threadpoolctl             2.1.0                    pypi_0    pypi
thriftpy2                 0.4.11                   pypi_0    pypi
tk                        8.6.10               hbc83047_0  
toml                      0.10.1                   pypi_0    pypi
tomlkit                   0.5.8                    pypi_0    pypi
typed-ast                 1.4.1                    pypi_0    pypi
typing-extensions         3.7.4.2                  pypi_0    pypi
tzlocal                   2.1                      pypi_0    pypi
urllib3                   1.25.10                  pypi_0    pypi
websocket-client          0.57.0                   pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.34.2                   py37_0  
whichcraft                0.6.1                    pypi_0    pypi
wrapt                     1.12.1                   pypi_0    pypi
xmltodict                 0.12.0                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
yarl                      1.5.1                    pypi_0    pypi
zipp                      3.1.0                    pypi_0    pypi
zlib                      1.2.11               h7b6447c_3  

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature and improvements (non-breaking change which adds/improves functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Code Refactoring (internal change which is not user facing)
  • Documentation
  • Test, CI, or build

Component(s) if applicable

  • BentoService (service definition, dependency management, API input/output adapters)
  • Model Artifact (model serialization, multi-framework support)
  • Model Server (mico-batching, dockerisation, logging, OpenAPI, instruments)
  • YataiService gRPC server (model registry, cloud deployment automation)
  • YataiService web server (nodejs HTTP server and web UI)
  • Internal (BentoML's own configuration, logging, utility, exception handling)
  • BentoML CLI

Checklist:

  • My code follows the bentoml code style, both ./dev/format.sh and
    ./dev/lint.sh script have passed
    (instructions).
  • My change reduces project test coverage and requires unit tests to be added
  • I have added unit tests covering my code change
  • My change requires a change to the documentation
  • I have updated the documentation accordingly

These tests are just demos to ensure that Spark has been installed
correctly and PySpark can be used. They are not the proper tests for a
PySparkSavedModelArtifact, and should be replaced when further code
has been written.
* Testing is far from complete (verifying prediction from PySpark
  model)
* PySparkModelArtifact has many TODOs that need to be addressed.
Were only used to sanity-check that PySpark itself was setup correctly.
Was presented in API proposal, but I'm unsure of its use. Will discuss
with BentoML maintainers when I create a PR.
Used to return Pandas DF. Harder to assert values this way, so a NumPy
array is returned instead. Also, adapt existing tests to work this
way as well.
* model_class -> ModelClass
* Remove comment
@joshuacwnewton joshuacwnewton marked this pull request as draft August 6, 2020 23:36
@joshuacwnewton
Copy link
Contributor Author

joshuacwnewton commented Aug 6, 2020

Disclaimer -- tests in PR only work locally when Spark is installed

PySpark does not work out of the box when pip install'd, because it has external Spark dependencies. I believe Java, Scala, and the Spark JAR files must be present. This affects:

  • Travis (although there are example workarounds, see below)
  • Docker integration (If I understand correctly, with the current Dockerfile that BentoML generates, the resulting image would be missing these dependencies?)

Right now I am looking more closely at the Clipper example to understand how they handled this issue, but I would love to hear maintainer opinions about how dependencies may have been handled in other cases (e.g. in relation to Docker image size concerns.)

@Talador12
Copy link

+1 keep at it @joshuacwnewton

Let me know if there is any way we can help this initiative.

@yubozhao
Copy link
Contributor

yubozhao commented Aug 20, 2020

@joshuacwnewton one way to include java(openjdk) is using Conda.

We can add env.add_conda_dependencies as part of the PySparkModel's set_dependencies

class PysparkModelArtifact(BentoServiceArtifact):
    def set_dependencies(self, env: BentoServiceEnv):
        env.add_conda_dependencies(['openjdk'])
        env.add_pip_dependencies_if_missing(['pyspark'])

I did a quick test and it works well.

@joshuacwnewton
Copy link
Contributor Author

joshuacwnewton commented Aug 21, 2020

My apologies, but my personal situation has changed, and I'm no longer able to work on this feature. 🙁


@yubozhao Good point! That just leaves Spark (and possibly Scala).

@Talador12 What needs help most is figuring out how to provide the dependencies PySpark needs. The questions I'm left with are:

  • Scala
    • Is there a way to handle the Scala dependency programmatically (as @yubozhao has done with JDK)?
    • Do we even need to install Scala if we use a prebuilt binary package? See this Spark 3.0.0 installation guide that doesn't even bother with Scala.
  • Spark
    • For Spark, is it enough to simply download a prebuilt binary package, untar it, and set the necessary environment variables (as is done in travis-pytest-spark)? Or, is anything more complex needed?
    • How should we handle cases where a user already has Spark installed? Does the version of Spark change how we handle an existing installation?
    • Is there a lightweight way to incorporate PySpark's dependencies into a BentoML docker image?
  • Does the installation of these dependencies vary between platforms?

Sorry I can't be of more help! Best of luck on this.

@yubozhao
Copy link
Contributor

@joshuacwnewton I am sorry to hear that your situation has changed.

I can't thank you enough for everything you did for this community. It was great to work with you. I am looking forward to your return in the future.

@yubozhao yubozhao added the help-wanted An issue currently lacks a contributor label Aug 24, 2020
@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #957 into master will decrease coverage by 0.25%.
The diff coverage is 59.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #957      +/-   ##
==========================================
- Coverage   62.86%   62.61%   -0.26%     
==========================================
  Files         123      126       +3     
  Lines        8112     8175      +63     
==========================================
+ Hits         5100     5119      +19     
- Misses       3012     3056      +44     
Impacted Files Coverage Δ
bentoml/adapters/json_input.py 57.89% <ø> (ø)
bentoml/artifact/pyspark_model_artifact.py 29.62% <29.62%> (ø)
bentoml/artifact/__init__.py 100.00% <100.00%> (ø)
bentoml/artifact/artifact.py 92.78% <100.00%> (-2.40%) ⬇️
bentoml/saved_bundle/bundler.py 88.50% <100.00%> (-0.14%) ⬇️
bentoml/service.py 88.01% <100.00%> (-0.29%) ⬇️
bentoml/saved_bundle/pip_pkg.py 81.29% <0.00%> (-13.24%) ⬇️
bentoml/yatai/deployment/operator.py 48.00% <0.00%> (-7.18%) ⬇️
bentoml/cli/config.py 31.50% <0.00%> (-5.21%) ⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3633032...3ab2175. Read the comment docs.

@parano parano mentioned this pull request Dec 23, 2020
18 tasks
@stale
Copy link

stale bot commented Jan 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 2, 2021
@parano parano closed this Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted An issue currently lacks a contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark MLlib support
4 participants