You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
... # some work omittedcleaned_data.filter(
regex="Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice"
).to_csv("outputs/cleaned_data_housing.csv", index=False)
artifact=lineapy.save(lineapy.file_system, "cleaned_data_housing")
Specifically, lineapy.file_system is not intuitive.
I speculate that this is because when the user is thinking about saving the "cleaned_data_housing.csv" from cleaned_data, they have to map that activity to a new concept, lineapy.file_system.
We initially went with lineapy.file_system because it's the most technically succinct way of describing the desired capture mechanism: via side-effects (this is similar to the requirement for asserts #449). Note the semantic difference as compared to lineapy.save(cleaned_data, ....), where we save the value of cleaned_data, and the slice would end with the final line that last changed cleaned_data (specifically in the notebook, cleaned_data = cleaned_data.drop(columns=Neighborhood_cats[0])). If we were to slice this into Airflow, the job would be a no-op since no change is persisted or passed to another job.
There are a few different options---all of them will require some form of additional annotation beyondlineapy.save(cleaned_data, ....).
lineapy.save(cleaned_data, include_side_effecs=True). Here, we'll include all the code that has side_effects that uses cleaned_data.
We can even set include_side_effecs to True by default to further reduce friction for this use case. It's hard to imagine when the user would want to exclude side effects. Though one scenario I can imagine is writing a sample to disk, and then writing the whole thing to a SQL database (but it's unlikely?).
Another option is to have a different API that saves the actual value of a something (and implicitly extracts out a process) vs. something that just extracts the process. So something like lineapy.save_pipeline(cleaned_data, "") and lineapy.save_value(cleaned_data, ""). My sense is that this option might be too indirect and confusing.
The benefit here would be that the .get would be clearer---we cannot do a .get on side effects but can on saved values.
We can also save the call, to_csv, and we can do so as a decorator to the line (@linea.save). But I'm not a fan because the user would have do the decoration before the invocation of the call, which doesn't go with our philosophy of deployment after the initial invocations.
Another consideration is fine-grained addressability: the downside of options 1 and 2 is that they don't allow us to now do more fine grained side_effect slicing, where as with lineapy.file_system, we can change it to something like lineapy.file_system("outputs/cleaned_data_housing.csv") or lineapy.file_system(cleaned_data) to be more fine grained. Option 3 allows the fine grained annotation directly.
These are just initial thoughts, please help brainstorm! cc @dorx per your request.
The text was updated successfully, but these errors were encountered:
From my point of view as user it is very important to define not only how to save something, but also how to load it in the next pipeline step. For example, when a dataframe is saved to csv, how do you know how the dataframe is loaded again in the next step? The same argument is valid for saving and loading ml models of different frameworks (pytorch, onnx, tensorflow, etc.).
A decorator for this feature would be fine for me.
We've gotten some initial feedback from alpha users that in the demo (https://github.com/LineaLabs/lineapy/blob/main/examples/Demo_1_Preprocessing.ipynb), the line
Specifically,
lineapy.file_system
is not intuitive.I speculate that this is because when the user is thinking about saving the "cleaned_data_housing.csv" from
cleaned_data
, they have to map that activity to a new concept,lineapy.file_system
.We initially went with
lineapy.file_system
because it's the most technically succinct way of describing the desired capture mechanism: via side-effects (this is similar to the requirement for asserts #449). Note the semantic difference as compared tolineapy.save(cleaned_data, ....)
, where we save the value ofcleaned_data
, and the slice would end with the final line that last changedcleaned_data
(specifically in the notebook,cleaned_data = cleaned_data.drop(columns=Neighborhood_cats[0])
). If we were to slice this into Airflow, the job would be a no-op since no change is persisted or passed to another job.There are a few different options---all of them will require some form of additional annotation beyond
lineapy.save(cleaned_data, ....)
.lineapy.save(cleaned_data, include_side_effecs=True)
. Here, we'll include all the code that has side_effects that usescleaned_data
.include_side_effecs
toTrue
by default to further reduce friction for this use case. It's hard to imagine when the user would want to exclude side effects. Though one scenario I can imagine is writing a sample to disk, and then writing the whole thing to a SQL database (but it's unlikely?).lineapy.save_pipeline(cleaned_data, "")
andlineapy.save_value(cleaned_data, "")
. My sense is that this option might be too indirect and confusing..get
would be clearer---we cannot do a.get
on side effects but can on saved values.to_csv
, and we can do so as a decorator to the line (@linea.save). But I'm not a fan because the user would have do the decoration before the invocation of the call, which doesn't go with our philosophy of deployment after the initial invocations.Another consideration is fine-grained addressability: the downside of options 1 and 2 is that they don't allow us to now do more fine grained side_effect slicing, where as with
lineapy.file_system
, we can change it to something likelineapy.file_system("outputs/cleaned_data_housing.csv")
orlineapy.file_system(cleaned_data)
to be more fine grained. Option 3 allows the fine grained annotation directly.These are just initial thoughts, please help brainstorm! cc @dorx per your request.
The text was updated successfully, but these errors were encountered: