Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: As a user, i want to retrieve a CrossValidationReporter from a project #1045

Closed
MarieS-WiMLDS opened this issue Jan 3, 2025 · 19 comments · Fixed by #1052
Closed

feat: As a user, i want to retrieve a CrossValidationReporter from a project #1045

MarieS-WiMLDS opened this issue Jan 3, 2025 · 19 comments · Fixed by #1052
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@MarieS-WiMLDS
Copy link
Contributor

MarieS-WiMLDS commented Jan 3, 2025

Is your feature request related to a problem? Please describe.

(I know it's not really a bug, but I wanted to highlight how weird it is for the user).

# %%
from skore import CrossValidationReporter

# %% 
rep = CrossValidationReporter(clf, X, y)
project.put("rep_cv", rep)
rep # outputs screenshot 1

# %%
rep_get = project.get("rep_cv")
rep_get # outputs screenshot 2

# %% 
rep_get_item = project.get_item("rep_cv")
rep_get_item # outputs screenshot 3

Screen 1:
image

Screen 2:
image

Screen 3:
image

This is highly inconsistant, and weird to the user. How do I know how to manipulate each of these?

Describe the solution you'd like

The solution to me is two folds:

  • get and get_item should render a CrossValidationReporter. If I close my notebook, I want to be able to get my cv_reporter, and still use the interesting features such as cv_rep.plots.scores.
  • having the display in the notebook is really nice! Let's keep it, but with a different function. It could be something like skore.prettify(reporter)

Describe alternatives you've considered, if relevant

Remove completely the get_item function. It's in this discussion: #990. Actually, I still think we should do it :).

Additional context

The solution proposed would also have the advantage to not loose information, while today put/get does.

@MarieS-WiMLDS MarieS-WiMLDS added the bug Something isn't working label Jan 3, 2025
@thomass-dev
Copy link
Collaborator

thomass-dev commented Jan 3, 2025

It's not a bug (as you said), it's a question of design.

In the current:

  • get returns the original data when it is possible, or a transformed data close to it,
  • get_item returns the Item, i.e. the object that contains your (transformed) data that can be persisted.

Currently and in state, after you have put your object in a project, you have zero guarantee that you can programmatically retrieve your original data (for example, you can't retrieve your CrossValidationReporter).

I know it's an opened and actual question, we should think about it.
I removed the bug label and changed the title, because it's not a bug and involves more thinking.

@thomass-dev thomass-dev removed the bug Something isn't working label Jan 3, 2025
@thomass-dev thomass-dev changed the title bug: inconsistance feat: As a user, i want to retrieve a CrossValidationReporter from a project Jan 3, 2025
@thomass-dev thomass-dev added the enhancement New feature or request label Jan 3, 2025
@glemaitre
Copy link
Member

I think it is related with #949

With my user perspective, I would expect to write something like:

# %%
import skore
my_project = skore.create("my_project", working_dir="/tmp")

# %%
from sklearn import datasets, linear_model
X, y = datasets.load_diabetes(return_X_y=True)
lasso = linear_model.Lasso()

# %%
reporter = my_project.CrossValidationReporter(lasso, X, y, cv=3)

What the reasoning behind this code as a user:

  • I don't need to use put or get
  • I automatically get the output of the reporter
  • I expect the UI to have register (or put) the item reporter in the project for me
  • I expect the activity feed to show me progress
  • I expect the UI to allow to play with the item

@tuscland
Copy link
Member

tuscland commented Jan 7, 2025

@glemaitre if you expect side-effects when you call my_project.CrossValidationReporter, should it be a verb instead? It looks like a factory.

@glemaitre
Copy link
Member

For me, I'm comfortable having the symmetry:

skore.CrossValidationReporter(...)
my_project.CrossValidationReporter(...)

with the exact same API. I can easily understand that when using the project, everything should get in there magically.

@MarieS-WiMLDS
Copy link
Contributor Author

MarieS-WiMLDS commented Jan 7, 2025

Rationale

We should, as much as possible, have a symmetry between put and get. In particular for the objects we create ourselves.
However, we can't have symmetry for everything. For the other things, we use pickle.
We will improve on the way if we can to make extend the supported types.
It's still unclear if we really have to store everything: what we can't track & analyze, is there a purpose for us to store?

key takeaways

  • Officially supported types for optimized storage and presentation
  • What you put is what you get (WYPIWYG)
  • Adopt Arrow for DataFrame and Series format
  • String can be hinted with display_as for alternative displays
  • Items are now private

Implementation

Supported types by skore

  1. What you put is what you get
  2. If the object is not a supported type, save as pickle
  • Pillow image

    • internally saved as raster image
    • same for display
  • Plotly

    • internally saved as JSON
    • same for display
  • Matplotlib

    • internally saved as pickle
    • displayed as SVG
  • JSON

    • internally saved as JSON
    • supports via a strict mapping to Python

    Case for strings:

    • Accept a display_as kwarg to modulate how the string is processed at display time
      display_as="html"
      display_as="markdown"
      display_as="svg"
  • DataFrames and Series

    • internally saved as arrow
    • Supported implementations: Pandas, Polars
  • Numpy Arrays

    • internally saved as npy allow_picke=false
  • Scikit-learn "compatible" estimator

    • internally saved as skops
    • displayed a HTML
  • Skore types

    • a composition of supported types
  • Not above

    • fallback to pickle, and raise a warning
    • display as str(object)

TODO

  • change the way data are stored
  • make items API private
  • make views API private

Sorry, something went wrong.

@tuscland tuscland added this to the Skore 0.7 milestone Jan 7, 2025
augustebaum pushed a commit that referenced this issue Jan 14, 2025

Verified

This commit was signed with the committer’s verified signature.
auguste-probabl Auguste Baum
This is for several reasons:
- it is not as explicit as several simple `put` calls: the mechanic
is ambiguous with regards to atomicity (if a key-value pair is
invalid, does it make the whole operation fail?)
- the mechanic makes it complicated to add options to `put`,
e.g. the `note` option proposed in
#1041
or the `display_as` option proposed in
#1045 (comment).
augustebaum pushed a commit that referenced this issue Jan 14, 2025

Verified

This commit was signed with the committer’s verified signature.
auguste-probabl Auguste Baum
This is for several reasons:
- it is not as explicit as several simple `put` calls: the mechanic
is ambiguous with regards to atomicity (if a key-value pair is
invalid, does it make the whole operation fail?)
- the mechanic makes it complicated to add options to `put`,
e.g. the `note` option proposed in
#1041
or the `display_as` option proposed in
#1045 (comment).
augustebaum pushed a commit that referenced this issue Jan 14, 2025

Verified

This commit was signed with the committer’s verified signature.
auguste-probabl Auguste Baum
This is for several reasons:
- it is not as explicit as several simple `put` calls: the mechanic
is ambiguous with regards to atomicity (if a key-value pair is
invalid, does it make the whole operation fail?)
- the mechanic makes it complicated to add options to `put`,
e.g. the `note` option proposed in
#1041
or the `display_as` option proposed in
#1045 (comment).
thomass-dev pushed a commit that referenced this issue Jan 14, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
This is for several reasons:
- it is not as explicit as several simple `put` calls: the mechanic
is ambiguous with regards to atomicity (if a key-value pair is
invalid, does it make the whole operation fail?)
- the mechanic makes it complicated to add options to `put`,
e.g. the `note` option proposed in
#1041
or the `display_as` option proposed in
#1045 (comment).
@thomass-dev
Copy link
Collaborator

thomass-dev commented Jan 15, 2025

@marie @Sylvain

To track the evolution of an object programmatically, do you need a list with all the versions of an item and their metadata (date), or a sorted list of values by date is sufficient?

In other words:

# A
versions = ['a', 'b', 'c', 'd']
# B
versions = [('a', '2020-01-01'), ('b', '2020-01-02'), ('c', '2020-01-03'), ('d', '2020-01-04')]

(i'm modifying the get_item_versions function to hide the Item class).

@tuscland
Copy link
Member

Could we structure this a bit with a dictionary or an object so it is future-proof? There is more metadata one will want to access for a particular item version.

@MarieS-WiMLDS
Copy link
Contributor Author

Agree with @tuscland, we might want to add things such as how and by whom the object was created at some point for team cooperation.

@sylvaincom
Copy link
Contributor

Agreed with @tuscland: soon there will be text comments for each version of an item also, see #1041

@thomass-dev
Copy link
Collaborator

@sylvaincom the backend part is already merged. So you want a dict with

{
    "value": ...,
    "date": ...,
    "note": ...,
}

@sylvaincom
Copy link
Contributor

Maybe also add the version number? So that it's more easily accessible

In the future, as @MarieS-WiMLDS pointed out, we will also add the user name

@thomass-dev
Copy link
Collaborator

In the future, as @MarieS-WiMLDS pointed out, we will also add the user name

You talk about collaboration here, we are far from it ^^.

@thomass-dev
Copy link
Collaborator

thomass-dev commented Jan 15, 2025

I propose a new API:

def get(self, key: str, *, latest=True, metadata=False): ...

latest is used to return the latest version or all the history.
metadata is used to return the metadata in addition of the value.

get("<key>") -> "<object2>"
get("<key>", latest=False) -> ["<object1>", "<object2>"]
get("<key>", latest=False, metadata=True) -> [{"object': "<object1>", "date": "<date>", "note": "<note>"}, ...]
get("<key>", metadata=True) -> {"object': "<object2>", "date": "<date>", "note": "<note>"}

What do you think? It is okay or you want to keep get_item_versions ?

@tuscland tuscland modified the milestones: Skore 0.6, Test Jan 15, 2025
@MarieS-WiMLDS
Copy link
Contributor Author

MarieS-WiMLDS commented Jan 15, 2025

I'm not comfortable with the option latest, it gives only two possibilities to the user. What about version that could take latest as input, all, or really a version number?

@thomass-dev
Copy link
Collaborator

thomass-dev commented Jan 15, 2025

Getting a specific version is only project.get("<key>", latest=False)[the_version_number_i_want] no?
What i dislike with your proposition, is that you assume that the user knows the number of versions in advance.

@augustebaum
Copy link
Contributor

augustebaum commented Jan 15, 2025

I would rather a get_versions that returns all versions and all metadata, than adding a bunch of optional arguments to get

@glemaitre
Copy link
Member

glemaitre commented Jan 15, 2025

version="latest" when it comes to get the latest version.

Regarding versioning, I'll take an experience that I have with OpenML that version datasets: you have a version parameter that take an int. However, by setting it you can be quite in the dark to know what are you going to get.

So I think that having a utility as proposed by @augustebaum that can allow me quickly to know what are the metadata (supposing that the metadata is the entry point allowing me to filter what I need) and the corresponding version to know what I need to pick would be super useful, without the need to get the item themself. I could even imagine some filtering feature allowing to get a subset of the versions.

Also it means that it simplifies the "job" what get is supposed to do: return an item. Potentially, we can imagine to always return a list (with a single object) and allow for stuff like

# items is always a sequence (or maybe a dict to be able to have a richer indexing)
items = get("key", version="latest")
items = get(version=get_versions("key", filter="5 <= version <= 10")
items = get(version=get_versions("key", filter"version > 10 and date < @target_date")

The filter syntax with this text would be the one use by df.query since I could easily think that the output of get_version could have a sort of table representation.

@tuscland
Copy link
Member

Isn't it weird to query for metadata using an API called get_versions?

What about:

# When `version` is specified, `get` returns a sequence of dictionaries
items = get("key", version=42)
items = get("key", version="latest") # "latest" is a filter specification like the other
items = get("key", version="5 <= version <= 10") # can be implemented later
items = get("key", version="version > 10 and date < @target_date")

@glemaitre
Copy link
Member

I think that I would be happier to assess only the info without to load potentially the items. I take the parallel of the conda/pip way to request metadata: pip you need(ed) to get the entire artefacts and then the get the metadata while conda you can already get only the metadata. At the end, it is a huge benefit if I'm not interested in the item.

get_versions is potentially a weird name. get_info(version=..., metadata=True/False) returning dataclasses might be my next thought :). get would have a parameter that take the output of get_info or potentially expose the parameter of get_info as a shortcuts (it can become overwhelming but I don't have the full understanding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
6 participants