[python-package] make scikit-learn estimator tags compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro · 2024-09-11T10:42:47Z

Tring to fix latest CI job. Sklearn 1.6.0 dev deprecates BaseEstimator._more_tags_() for __sklearn_tags__

see https://scikit-learn.org/dev/whats_new/v1.6.html and scikit-learn/scikit-learn#29677

vnherdeiro · 2024-09-11T13:53:31Z

Update:

The change introduced in scikit-learn/scikit-learn#29677 makes it hard to subclass a sklearn estimator in a codebase while being compatible with sklearn < 1.6.0 and sklearn >= 1.6.0. Essentially the former looks up ._more_tags() and ignore __sklearn_tags__() while the former looks up __sklearn_tags__() and forbids existence of a
._more_tags() tags method.

The issue is discussed here:
scikit-learn/scikit-learn#29801

and it looks like a relaxation of the impossibility of having both ._more_tags() and __sklearn_tags__() simulatenously will be relaxed. If it goes through let's park this MR until lightgbm decides to force a scikit-learn>=1.6.0 dependency.

adrinjalali · 2024-09-12T10:33:53Z

@vnherdeiro note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment)), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801

vnherdeiro · 2024-09-12T12:03:47Z

Correct I am waiting for that PR to go in to bring back _more_tags Using @available_if would require another sklearn import and make the code less readable I reckon

…

On Thu, 12 Sept 2024, 11:34 am Adrin Jalali, ***@***.***> wrote: @vnherdeiro <https://github.com/vnherdeiro> note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment) <scikit-learn/scikit-learn#29677 (comment)>), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801 <scikit-learn/scikit-learn#29801> — Reply to this email directly, view it on GitHub <#6651 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE4CNVUURU6AMLDYUXKPFTTZWFU2TAVCNFSM6AAAAABOAVNTLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBVHA4DCOJWGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jameslamb · 2024-09-15T03:08:09Z

Thanks for starting on this @vnherdeiro . I've documented it in an issue: #6653 (and added that to the PR description).

Note there that I intentionally put the exact errors messages in plain text instead of just referring to _more_tags() ... that helps people to find this work from search engines.

Note also that the _more_tags() thing is only 1 of 3 breaking changes in scikit-learn that lightgbm will have to adjust to to get those tests passing again with scikit-learn==1.6.0.

jameslamb

Thanks for starting on this! Please see scikit-learn/scikit-learn#29801 (comment):

The story becomes "If you want to support multiple scikit-learn versions, define both."

I think we should leave _more_tags() untouched and add __sklearn_tags__(). And have self.__sklearn_tags__() call self._more_tags() to get its data, so we don't define things like _xfail_checks twice.

Do you have time to do that in the next few days? We need to fix this to unblock CI here, so if you don't have time to fix it this week please let me know and I will work on this.

…n_tags

vnherdeiro · 2024-09-15T12:12:23Z

@jameslamb Have just pushe a sklearn_tags trying a conversion from _more_tags. I added a out of current argument scope warning to catch a change from the arguments in _more_tags (they don't seem to change much though).

adrinjalali

Not a maintainer here, but coming from sklearn side. Leaving thoughts hoping it'd help.

python-package/lightgbm/sklearn.py

jameslamb

Thanks for this.

I've reviewed the dataclasses at https://github.com/scikit-learn/scikit-learn/blob/e2ee93156bd3692722a39130c011eea313628690/sklearn/utils/_tags.py and agree with the choices you've made about how to map the dictionary-formatted values from _more_tags() to the dataclass attributes scikit-learn now prefers.

Please see the other comments about simplifying this.

python-package/lightgbm/sklearn.py

Co-authored-by: James Lamb <jaylamb20@gmail.com>

vnherdeiro · 2024-09-16T06:36:59Z

@jameslamb have adressed your comments! thanks for the review!

adrinjalali

I'd probably include a test to make sure X_types is exactly as is here, so that when somebody changes it in the future in _more_tags, the corresponding tags in __sklearn_tags__ is also changed (and the test itself)

jameslamb · 2024-09-17T03:41:42Z

I'd probably include a test to make sure X_types is exactly as is here, so that when somebody changes it in the future in _more_tags, the corresponding tags in sklearn_tags is also changed (and the test itself)

I started looking into this and realized that LGBMModel._more_tags() is being overwritten in LGBMClassifier / LGBMRegressor / LGBMRanker.

I'll push commits here adding this test and fixing that.

jameslamb · 2024-09-17T05:03:18Z

This is proving to be very challenging to get right, because LGBMRegressor / LGBMClassifier have MRO like this:

python -c "import lightgbm; print(lightgbm.LGBMRegressor.__mro__)"
# (<class 'lightgbm.sklearn.LGBMRegressor'>,
# <class 'sklearn.base.RegressorMixin'>,
# <class 'lightgbm.sklearn.LGBMModel'>,
# <class 'sklearn.base.BaseEstimator'>,
# <class 'sklearn.utils._estimator_html_repr._HTMLDocumentationLinkMixin'>,
# <class 'sklearn.utils._metadata_requests._MetadataRequester'>,
# <class 'object'>

(we do that intentionally, following the advice from "BaseEstimator and mixins" at https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator)

I'm finding it difficult to preserve the LightGBM-specific changes that we want (that @vnherdeiro has implemented here) without them being overwritten by the _more_tags() and __sklearn_tags__() coming from sklearn.base.BaseEstimator, and with protection against such methods possibly being added to sklearn.base.RegressorMixin in the future.

Will come back to this tomorrow, when I can, and will try to put together a clear reproducible example. The amount of indirection here means that'll take a bit more time than I have today.

jameslamb · 2024-09-24T05:39:44Z

.ci/test.sh

@@ -103,6 +103,7 @@ if [[ $TASK == "lint" ]]; then
        'mypy>=1.11.1' \
        'pre-commit>=3.8.0' \
        'pyarrow-core>=17.0' \
+        'scikit-learn>=1.15.0' \


This is to ensure that mypy checks scikit-learn imports. Extra important now that I'm proposing adding an optional type hint on this new sklearn.utils.Tags.

jameslamb · 2024-09-24T05:44:56Z

python-package/lightgbm/sklearn.py

+                "check_n_features_in_after_fitting": (
+                    "validate_data() was first added in scikit-learn 1.6 and lightgbm"
+                    "supports much older versions than that"
+                ),


On the 1.6.dev nightlies, scikit-learn is raising this error:

E AssertionError: LGBMRegressor.predict() does not check for consistency between input number
E of features with LGBMRegressor.fit(), via the n_features_in_ attribute.
E You might want to use sklearn.utils.validation.validate_data instead
E of check_array in LGBMRegressor.fit() and LGBMRegressor.predict()`. This can be done
E like the following:
E from sklearn.utils.validation import validate_data

We should ignore this check here in LightGBM... validate_data() will be added for the first time in scikit-learn 1.6:

https://github.com/scikit-learn/scikit-learn/blob/74a33757c8a8df84d227f28bbc9ec7ae2fb51dea/sklearn/utils/validation.py#L2790

API move BaseEstimator._validate_data to utils.validation.validate_data scikit-learn/scikit-learn#29696

We have other mechanisms further down in LightGBM to check shape mismatches between training data and the data provided at scoring time. I'd rather rely on those than take on the complexity of try-catching a call to this new-in-v1.6 validate_data() function.

Understand you don't want to use validate_data here, but you can still conform to the API with your own tools.

You probably also want to make sure you store n_feature_in_ as well, to better imitate sklearn's behavior.

I would personally go down the fixes.py path though.

Understand you don't want to use validate_data here, but you can still conform to the API with your own tools.

How could we avoid the check_n_features_in_after_fitting check failing without calling validate_data()? Could you point to a doc I could reference?

You probably also want to make sure you store n_feature_in_ as well, to better imitate sklearn's behavior.

We do.

LightGBM/python-package/lightgbm/sklearn.py

Lines 1063 to 1068 in 41ba9e8

@property

def n_features_in_(self) -> int:

""":obj:`int`: The number of features of fitted model."""

if not self.__sklearn_is_fitted__():

raise LGBMNotFittedError("No n_features_in found. Need to call fit beforehand.")

return self._n_features_in

jameslamb · 2024-09-24T05:47:58Z

python-package/lightgbm/sklearn.py

+    def _more_tags(self) -> Dict[str, Any]:
+        # handle the case where ClassifierMixin possibly provides _more_tags()
+        if callable(getattr(_LGBMClassifierBase, "_more_tags", None)):
+            tags = _LGBMClassifierBase._more_tags(self)


Proposing all these uses of {some_class}.{some_method} instead of super().{some_method} because we follow this advice from scikit-learn's docs (https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator):

...mixins should be “on the left” while the BaseEstimator should be “on the right” in the inheritance list for proper MRO.

Using super() would get the _more_tags() / __sklearn_tags__() from e.g. sklearn.base.RegressorMixin, but we want to use LightGBM's specific tags.

jameslamb · 2024-09-24T05:52:29Z

I've pushed commits here adding testing and ensuring that lightgbm's preferred tags are not silently overridden by those added to BaseEstimator, ClassifierMixin, or RegressorMixin.

Since I've added so much code to this, my review should not count towards a merge.

@StrikerRUS or @jmoralez could you please review whenever you have time?

And of course @adrinjalali we'd welcome your feedback if you have time/interest. It's been great having you here helping us adapt so far!

adrinjalali

Something that's happening here, is that you're adding complexity in a few places, to handle different dependency versions. This is quite a common pattern, and we have it whenever we have dependencies and support multiple versions.

What we tend to do instead, is to have a utils/fixes.py kinda thing, where we put all version dependent code, and we only call those methods / import from there. That means we mostly have only one file to look at, when we upgrade minimum dependency versions.

These are two examples:

adrinjalali · 2024-09-24T07:47:55Z

python-package/lightgbm/sklearn.py

+        # _LGBMModelBase.__sklearn_tags__() cannot be called unconditionally,
+        # because that method isn't defined for scikit-learn<1.6
+        if not callable(getattr(_LGBMModelBase, "__sklearn_tags__", None)):
+            return None


I would personally prefer using an available_if here since now this logic is not less complicated as having that one. But this works too. However, maybe raising an AttributeError would be better? This method doesn't need to exist in older sklearn versions.

Do you mean sklearn.utils.available_if?

https://github.com/scikit-learn/scikit-learn/blob/74a33757c8a8df84d227f28bbc9ec7ae2fb51dea/sklearn/utils/_available_if.py#L57

I prefer this method with getattr() that doesn't take on another sklearn import that could possibly be moved or changed in future versions.

adrinjalali · 2024-09-24T07:49:18Z

python-package/lightgbm/sklearn.py

+        return self._update_sklearn_tags_from_dict(
+            tags=_LGBMModelBase.__sklearn_tags__(self),
+            tags_dict=self._more_tags(),
+        )


wondering why you're not getting it through super() to let the MRO decide?

I explained this here: #6651 (comment)

adrinjalali · 2024-09-24T07:50:53Z

python-package/lightgbm/sklearn.py

+    def _more_tags(self) -> Dict[str, Any]:
+        # handle the case where ClassifierMixin possibly provides _more_tags()
+        if callable(getattr(_LGBMClassifierBase, "_more_tags", None)):
+            tags = _LGBMClassifierBase._more_tags(self)


_more_tags shouldn't care about other classes and the MRO, it should only return what it wants to add, so I'm not sure why this complexity here is needed.

Also, interesting that Classifier tags are needed in the Regressor class

_more_tags shouldn't care about other classes and the MRO, i

This is explained here: #6651 (comment)

I'm not confidence that RegressorMixin / ClassifierMixin won't add a _more_tags(), and I don't want those to silently override LightGBM's preferred tags because we follow scikit-learn's advice to put mixins first in the MRO.

Also, interesting that Classifier tags are needed in the Regressor class

Thank you! This was a copy-paste mistake. Fixed in d1915c0.

It wasn't caught by tests because the tags for LGBMClassifier, LGBMRegressor, and LGBMRanker happen to be the same today.

adrinjalali · 2024-09-24T07:56:30Z

python-package/lightgbm/sklearn.py

+                "check_n_features_in_after_fitting": (
+                    "validate_data() was first added in scikit-learn 1.6 and lightgbm"
+                    "supports much older versions than that"
+                ),


Understand you don't want to use validate_data here, but you can still conform to the API with your own tools.

You probably also want to make sure you store n_feature_in_ as well, to better imitate sklearn's behavior.

I would personally go down the fixes.py path though.

python-package/lightgbm/sklearn.py

StrikerRUS

Thank you all guys for working on this PR!
Generally LGTM, except some quite minor comments below:

StrikerRUS · 2024-09-24T20:13:51Z

python-package/lightgbm/sklearn.py

+    # sklearn.utils.Tags can be imported unconditionally once
+    # lightgbm's minimum scikit-learn version is 1.6 or higher
+    try:
+        from sklearn.utils import Tags as _sklearn_Tags
+    except ImportError:
+        _sklearn_Tags = None


I think this piece of code should go to compat.py.

StrikerRUS · 2024-09-24T20:18:29Z

python-package/lightgbm/sklearn.py

+                "check_n_features_in_after_fitting": (
+                    "validate_data() was first added in scikit-learn 1.6 and lightgbm"
+                    "supports much older versions than that"


I think we should add in this comment that LightGBM supports predict_disable_shape_check=True and we won't call validate_data() even after minimum sklearn version bump.

StrikerRUS · 2024-09-24T20:30:26Z

python-package/lightgbm/sklearn.py

+    def __sklearn_tags__(self) -> Optional["_sklearn_Tags"]:
+        return LGBMModel.__sklearn_tags__(self)


We need __sklearn_tags__() in LGBMRegressor and LGBMClassifier due to MRO again, right?

__sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_

1adb77b

vnherdeiro requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners September 11, 2024 10:42

vnherdeiro added 5 commits September 11, 2024 12:01

fixing tags dict -> dataclass

8ed87d2

fixing wrong import

32ec431

remove type hint

ade9798

remove type hint

2085a12

fix linting

a9ec348

triggering new CI (scikit-learn dev has changed)

fcc4e12

jameslamb mentioned this pull request Sep 15, 2024

[ci] [python-package] scikit-learn compatibility tests fail with scikit-learn 1.6.dev0 #6653

Open

jameslamb requested changes Sep 15, 2024

View reviewed changes

jameslamb changed the title ~~__sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_~~ [python-package] make scikit-learn tags compatible with scikit-learn>=1.16 Sep 15, 2024

jameslamb added in progress fix labels Sep 15, 2024

jameslamb mentioned this pull request Sep 15, 2024

[ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier #6654

Merged

jameslamb changed the title ~~[python-package] make scikit-learn tags compatible with scikit-learn>=1.16~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16 Sep 15, 2024

bringing back _more_tags, adding convertsion from more_tags to sklear…

3b15646

…n_tags

vnherdeiro changed the title ~~[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev Sep 15, 2024

lint fix

34d9eb4

adrinjalali reviewed Sep 15, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

jameslamb requested changes Sep 16, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

vnherdeiro and others added 2 commits September 16, 2024 07:34

Update python-package/lightgbm/sklearn.py

6d20ef8

Co-authored-by: James Lamb <jaylamb20@gmail.com>

adressing PR comments

d715311

move comment

c4ec9a4

adrinjalali reviewed Sep 16, 2024

View reviewed changes

jameslamb added 2 commits September 20, 2024 23:57

updates

b0a4703

remove uses of super()

7eb861a

jameslamb reviewed Sep 24, 2024

View reviewed changes

fix version constraint in lint job, add one more comment

b137ba2

jameslamb mentioned this pull request Sep 24, 2024

Enforce feature_names_in_ and n_features_in_ in check_estimator post SLEP007 implementation scikit-learn/scikit-learn#28337

Open

adrinjalali reviewed Sep 24, 2024

View reviewed changes

jameslamb reviewed Sep 24, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

Update python-package/lightgbm/sklearn.py

d1915c0

StrikerRUS requested changes Sep 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] make scikit-learn estimator tags compatible with `scikit-learn>=1.6.0dev` #6651

[python-package] make scikit-learn estimator tags compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro commented Sep 11, 2024 •

edited by jameslamb

Loading

vnherdeiro commented Sep 11, 2024

adrinjalali commented Sep 12, 2024

vnherdeiro commented Sep 12, 2024 via email

jameslamb commented Sep 15, 2024 •

edited

Loading

jameslamb left a comment

vnherdeiro commented Sep 15, 2024

adrinjalali left a comment

jameslamb left a comment

vnherdeiro commented Sep 16, 2024

adrinjalali left a comment

jameslamb commented Sep 17, 2024

jameslamb commented Sep 17, 2024

jameslamb Sep 24, 2024

jameslamb Sep 24, 2024

adrinjalali Sep 24, 2024

jameslamb Sep 24, 2024

jameslamb Sep 24, 2024

jameslamb commented Sep 24, 2024

adrinjalali left a comment

adrinjalali Sep 24, 2024

jameslamb Sep 24, 2024

adrinjalali Sep 24, 2024

jameslamb Sep 24, 2024

adrinjalali Sep 24, 2024

jameslamb Sep 24, 2024

adrinjalali Sep 24, 2024

StrikerRUS left a comment

StrikerRUS Sep 24, 2024

StrikerRUS Sep 24, 2024

StrikerRUS Sep 24, 2024

	@property
	def n_features_in_(self) -> int:
	""":obj:`int`: The number of features of fitted model."""
	if not self.__sklearn_is_fitted__():
	raise LGBMNotFittedError("No n_features_in found. Need to call fit beforehand.")
	return self._n_features_in

		def __sklearn_tags__(self) -> Optional["_sklearn_Tags"]:
		return LGBMModel.__sklearn_tags__(self)

[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev #6651

Are you sure you want to change the base?

[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev #6651

Conversation

vnherdeiro commented Sep 11, 2024 • edited by jameslamb Loading

vnherdeiro commented Sep 11, 2024

adrinjalali commented Sep 12, 2024

vnherdeiro commented Sep 12, 2024 via email

jameslamb commented Sep 15, 2024 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

vnherdeiro commented Sep 15, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

vnherdeiro commented Sep 16, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

jameslamb commented Sep 17, 2024

jameslamb commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Sep 24, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[python-package] make scikit-learn estimator tags compatible with `scikit-learn>=1.6.0dev` #6651

[python-package] make scikit-learn estimator tags compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro commented Sep 11, 2024 •

edited by jameslamb

Loading

jameslamb commented Sep 15, 2024 •

edited

Loading