Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration of more intuitive .save designs. #478

Open
yifanwu opened this issue Jan 7, 2022 · 1 comment
Open

Exploration of more intuitive .save designs. #478

yifanwu opened this issue Jan 7, 2022 · 1 comment
Labels
documentation Improvements or additions to documentation ux_design

Comments

@yifanwu
Copy link
Contributor

yifanwu commented Jan 7, 2022

We've gotten some initial feedback from alpha users that in the demo (https://github.com/LineaLabs/lineapy/blob/main/examples/Demo_1_Preprocessing.ipynb), the line

... # some work omitted

cleaned_data.filter(
    regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_csv("outputs/cleaned_data_housing.csv", index=False)

artifact = lineapy.save(lineapy.file_system, "cleaned_data_housing")

Specifically, lineapy.file_system is not intuitive.

I speculate that this is because when the user is thinking about saving the "cleaned_data_housing.csv" from cleaned_data, they have to map that activity to a new concept, lineapy.file_system.

We initially went with lineapy.file_system because it's the most technically succinct way of describing the desired capture mechanism: via side-effects (this is similar to the requirement for asserts #449). Note the semantic difference as compared to lineapy.save(cleaned_data, ....), where we save the value of cleaned_data, and the slice would end with the final line that last changed cleaned_data (specifically in the notebook, cleaned_data = cleaned_data.drop(columns=Neighborhood_cats[0])). If we were to slice this into Airflow, the job would be a no-op since no change is persisted or passed to another job.

There are a few different options---all of them will require some form of additional annotation beyond lineapy.save(cleaned_data, ....).

  1. lineapy.save(cleaned_data, include_side_effecs=True). Here, we'll include all the code that has side_effects that uses cleaned_data.
  • We can even set include_side_effecs to True by default to further reduce friction for this use case. It's hard to imagine when the user would want to exclude side effects. Though one scenario I can imagine is writing a sample to disk, and then writing the whole thing to a SQL database (but it's unlikely?).
  1. Another option is to have a different API that saves the actual value of a something (and implicitly extracts out a process) vs. something that just extracts the process. So something like lineapy.save_pipeline(cleaned_data, "") and lineapy.save_value(cleaned_data, ""). My sense is that this option might be too indirect and confusing.
  • The benefit here would be that the .get would be clearer---we cannot do a .get on side effects but can on saved values.
  1. We can also save the call, to_csv, and we can do so as a decorator to the line (@linea.save). But I'm not a fan because the user would have do the decoration before the invocation of the call, which doesn't go with our philosophy of deployment after the initial invocations.

Another consideration is fine-grained addressability: the downside of options 1 and 2 is that they don't allow us to now do more fine grained side_effect slicing, where as with lineapy.file_system, we can change it to something like lineapy.file_system("outputs/cleaned_data_housing.csv") or lineapy.file_system(cleaned_data) to be more fine grained. Option 3 allows the fine grained annotation directly.

These are just initial thoughts, please help brainstorm! cc @dorx per your request.

@yifanwu yifanwu added documentation Improvements or additions to documentation ux_design labels Jan 7, 2022
@pd-t
Copy link
Contributor

pd-t commented Jan 10, 2023

@yifanwu Maybe one more thought:

From my point of view as user it is very important to define not only how to save something, but also how to load it in the next pipeline step. For example, when a dataframe is saved to csv, how do you know how the dataframe is loaded again in the next step? The same argument is valid for saving and loading ml models of different frameworks (pytorch, onnx, tensorflow, etc.).

A decorator for this feature would be fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ux_design
Projects
None yet
Development

No branches or pull requests

2 participants