Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Use estimator whenever possible to detect the ML task #998

Merged
merged 5 commits into from
Jan 3, 2025

Conversation

glemaitre
Copy link
Member

I think that I saw the feedback somewhere and try to consolidate here.
Using estimator to find the type of ML-task is more robust that using the target y.

Here, I change the logic to use estimator as much as possible. However, if not fitted, I fallback on the target to detect which type of classification we are facing. If only y is provided, I use the previous approach only relying on y.

I added a test file.

NB: is_clusterer is only available from sklearn 1.6. I vendor a file _sklearn_compat.py that contains utility to make it easy to have developer tools working from sklearn 1.2 to 1.6. While some of them are not useful right now, I just want to vendor it completely. The package itself is tested and developed here: https://github.com/sklearn-compat/sklearn-compat.

@glemaitre glemaitre mentioned this pull request Dec 23, 2024
19 tasks
@thomass-dev thomass-dev force-pushed the main branch 7 times, most recently from ad7922d to 469d1e5 Compare December 31, 2024 14:42
@thomass-dev thomass-dev self-requested a review January 3, 2025 14:19
Copy link
Collaborator

@thomass-dev thomass-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this improvement :)

skore/src/skore/sklearn/_sklearn_compat.py Outdated Show resolved Hide resolved
skore/src/skore/sklearn/find_ml_task.py Outdated Show resolved Hide resolved
skore/src/skore/sklearn/find_ml_task.py Show resolved Hide resolved
@thomass-dev thomass-dev merged commit b7bf3ea into probabl-ai:main Jan 3, 2025
19 checks passed
thomass-dev pushed a commit that referenced this pull request Jan 10, 2025
closes #834

Investigate an API for a `EstimatorReport`.

#### TODO

- [x] Metrics
  - [x] handle string metrics has specified in the accessor
  - [x] handle callable metrics
  - [x] handle scikit-learn scorers
  - [x] use efficiently the cache as much as possible
  - [x] add testing for all of those features
- [x] allow to pass new validation set to functions instead of using the
internal validation set
  - [x] add a proper help and rich `__repr__`
- [x] Plots
  - [x] add the roc curve display
  - [x] add the precision recall curve display
  - [x] add prediction error display for regressor
  - [x] make proper testing for those displays
  - [x] add a proper `__repr__` for those displays
- [x] Documentation 
- [x] (done for the checked part) add an example to showcase all the
different features
- [x] find a way to show the accessors documentation in the page of
`EstimatorReport`. It could be a bit tricky because they are only
defined once the instance created.
- We need to have a look at the `series.rst` page from pandas to see how
they document this sort of pattern.
- [x] check the autocompletion: when typing `report.metrics.->tab` it
should provide the autocompetion. **edit**: having a stub file is
actually working. I prefer this than type hints directly in the file.
- Open questions
  - [x] we use hashing to retrieve external set.
- use the caching for the external validation set? To make it work we
need to compute the hash of potentially big arrays. This might more
costly than making the model predict.

#### Notes

This PR build upon:
- #962 to reuse the
`skore.console`
- #998 to be able to detect
clusterer in a consistent manner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants