-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separated pandas and numpy implementations of sklearn. #21803
Separated pandas and numpy implementations of sklearn. #21803
Conversation
R: @robertwb R: @yeandy R: @TheNeuralBit |
Should we remove the NOTE about backwards incompatibility on the numpy implementation? |
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
yeah, let's remove the notes about backwards compatability here and everywhere, since the whole point of this is to try and set our API as supported. |
@@ -94,9 +91,6 @@ class SklearnModelHandlerPandas(ModelHandler[pandas.DataFrame, | |||
BaseEstimator]): | |||
""" Implementation of the ModelHandler interface for scikit-learn that | |||
supports pandas dataframes. | |||
|
|||
NOTE: This API and its implementation are under development and | |||
do not provide backward compatibility guarantees. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @robertwb wanted to keep the implementations with named inputs marked experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We're moving to marking those things that are experimental at a more fine-grained level, and the named inputs should fall into that class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I can leave this note here, but I don't really see support for pandas dataframes as something separate. Sklearn users will want it as much as numpy array support, since sklearn models are built on top of have named inputs I'm not really understanding why we would modify our support for numpy but not sklearn arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a question of what is likely to change. I am 99% we'll want to change the way we handle dataframes, not so sure about numpy. We could call it safe and mark both (as long as there's still enough meat in the "non-experimental" portions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific on why this might change - now that the batching DoFn infrastructure is in, I'd like to make the pandas sklearn implementation leverage it. We'd move to a model where the element type is Beam Row (with schema), and the batch type is a pandas DataFrame. As opposed to the current model where the batch type is a list of single element dataframes.
Once we do that we could pass data from the DataFrame API (under the hood a PCollection[pd.DataFrame]
) directly to RunInference, without having to unbatch it and then batch it back up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I agree with this change. Let's do that as part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Unfortunately there are issues with Jenkins right now, and it looks like this needs a rebase/merge.
Codecov Report
@@ Coverage Diff @@
## master #21803 +/- ##
==========================================
- Coverage 74.15% 74.13% -0.02%
==========================================
Files 698 698
Lines 92417 92400 -17
==========================================
- Hits 68530 68503 -27
- Misses 22636 22646 +10
Partials 1251 1251
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Co-authored-by: Brian Hulette <hulettbh@gmail.com>
This ran well locally for me. Have. one more look and merge if it looks fine. |
Could you also run DirectRunner tests locally, since we don't have Jenkins running PostCommits? |
@ryanthompson591 indicated offfline that those tests ( |
Lint PreCommit is running now and failed:
|
Yep fixed. I'm not sure why my link hooks don't verify import order.
…On Mon, Jun 13, 2022, 2:53 PM Brian Hulette ***@***.***> wrote:
Lint PreCommit is running now and failed:
11:12:05 from apache_beam.ml.inference.base import PredictionResult
11:12:05 from apache_beam.ml.inference.base import RunInference
11:12:05 from apache_beam.ml.inference.sklearn_inference import ModelFileType
11:12:05 +from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
11:12:05 from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerPandas
11:12:05 -from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
11:12:05 from apache_beam.testing.test_pipeline import TestPipeline
11:12:05 from apache_beam.testing.util import assert_that
11:12:05 from apache_beam.testing.util import equal_to
—
Reply to this email directly, view it on GitHub
<#21803 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS23CRJ4GTJBIBN3BQF2OLVO57SXANCNFSM5YOTLHWA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I can help trace the code so you know what to add to your hooks :) The precommit runs the Lines 107 to 120 in d2fb942
Which runs the run_pylint.sh script, which runs isort with this definition: beam/sdks/python/scripts/run_pylint.sh Lines 109 to 110 in d2fb942
|
* separated pandas and numpy implementations * separated pandas and numpy implementations * Update sdks/python/apache_beam/ml/inference/sklearn_inference.py Co-authored-by: Brian Hulette <hulettbh@gmail.com> * fixed unit test * merged to fix conflicts * fixed import order Co-authored-by: Brian Hulette <hulettbh@gmail.com>
Pandas and numpy implementations of sklearn runinference handler are now separate implementations.
For example, of pandas will now call the runinference in this fashion.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.