Exploration of more intuitive `.save` designs. #478

yifanwu · 2022-01-07T02:55:44Z

We've gotten some initial feedback from alpha users that in the demo (https://github.com/LineaLabs/lineapy/blob/main/examples/Demo_1_Preprocessing.ipynb), the line

... # some work omitted

cleaned_data.filter(
    regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_csv("outputs/cleaned_data_housing.csv", index=False)

artifact = lineapy.save(lineapy.file_system, "cleaned_data_housing")

Specifically, lineapy.file_system is not intuitive.

I speculate that this is because when the user is thinking about saving the "cleaned_data_housing.csv" from cleaned_data, they have to map that activity to a new concept, lineapy.file_system.

We initially went with lineapy.file_system because it's the most technically succinct way of describing the desired capture mechanism: via side-effects (this is similar to the requirement for asserts #449). Note the semantic difference as compared to lineapy.save(cleaned_data, ....), where we save the value of cleaned_data, and the slice would end with the final line that last changed cleaned_data (specifically in the notebook, cleaned_data = cleaned_data.drop(columns=Neighborhood_cats[0])). If we were to slice this into Airflow, the job would be a no-op since no change is persisted or passed to another job.

There are a few different options---all of them will require some form of additional annotation beyond lineapy.save(cleaned_data, ....).

lineapy.save(cleaned_data, include_side_effecs=True). Here, we'll include all the code that has side_effects that uses cleaned_data.

We can even set include_side_effecs to True by default to further reduce friction for this use case. It's hard to imagine when the user would want to exclude side effects. Though one scenario I can imagine is writing a sample to disk, and then writing the whole thing to a SQL database (but it's unlikely?).

Another option is to have a different API that saves the actual value of a something (and implicitly extracts out a process) vs. something that just extracts the process. So something like lineapy.save_pipeline(cleaned_data, "") and lineapy.save_value(cleaned_data, ""). My sense is that this option might be too indirect and confusing.

The benefit here would be that the .get would be clearer---we cannot do a .get on side effects but can on saved values.

We can also save the call, to_csv, and we can do so as a decorator to the line (@linea.save). But I'm not a fan because the user would have do the decoration before the invocation of the call, which doesn't go with our philosophy of deployment after the initial invocations.

Another consideration is fine-grained addressability: the downside of options 1 and 2 is that they don't allow us to now do more fine grained side_effect slicing, where as with lineapy.file_system, we can change it to something like lineapy.file_system("outputs/cleaned_data_housing.csv") or lineapy.file_system(cleaned_data) to be more fine grained. Option 3 allows the fine grained annotation directly.

These are just initial thoughts, please help brainstorm! cc @dorx per your request.

The text was updated successfully, but these errors were encountered:

pd-t · 2023-01-10T17:41:15Z

@yifanwu Maybe one more thought:

From my point of view as user it is very important to define not only how to save something, but also how to load it in the next pipeline step. For example, when a dataframe is saved to csv, how do you know how the dataframe is loaded again in the next step? The same argument is valid for saving and loading ml models of different frameworks (pytorch, onnx, tensorflow, etc.).

A decorator for this feature would be fine for me.

yifanwu added documentation Improvements or additions to documentation ux_design labels Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration of more intuitive `.save` designs. #478

Exploration of more intuitive `.save` designs. #478

yifanwu commented Jan 7, 2022

pd-t commented Jan 10, 2023

Exploration of more intuitive .save designs. #478

Exploration of more intuitive .save designs. #478

Comments

yifanwu commented Jan 7, 2022

pd-t commented Jan 10, 2023

Exploration of more intuitive `.save` designs. #478

Exploration of more intuitive `.save` designs. #478