Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict transform to columns? #81

Closed
antholzer opened this issue May 23, 2022 · 4 comments
Closed

Restrict transform to columns? #81

antholzer opened this issue May 23, 2022 · 4 comments

Comments

@antholzer
Copy link
Contributor

At the moment in order to apply a Colwise transformation to a selected column(s) one has to use Select or is there another way?

I am finding that Select has some drawbacks when used in this way since it results in all the non selected columns being put into the cache but for e.g. the following pipeline this would not be necessary

(Select(:a) → MinMax()) ⊔ (Select(:b) → ZScore())

If I have a large pipeline/table, getting a large cache could be annoying . There is also an issue with revert #80.

So would it make sense to introduce a wrapper transform (or something else), where we give it a subset of columns, that only applies the (Colwise) transform to the subset of columns? For the pipeline above this could for example look like the following

Restrict(:a)(MinMax()) → Restrict(:b)(ZScore())
@juliohm
Copy link
Member

juliohm commented May 23, 2022

Can you elaborate on why the large cache is an issue in this scenario? Is it because you expected more performance? Or because the memory footprint is too high for the application?

Assuming that it is indeed an issue, would it make sense to "materialize" the pipeline at a specific step? For example, you can always drop the cache by evaluating the pipeline with pipeline(table) and then take the resulting table and insert it into a second pipeline. Would that work?

I am asking these questions because we could consider adding a column selection feature to all Colwise transforms, but this increased code complexity must be justified with a popular use case.

@antholzer
Copy link
Contributor Author

It is not really an issue for my application since the amount of data is not large.

In my use case I save the cache of the pipeline to later use it on new data samples. Saving/Using the cache of Select is a problem in this case since revert would re-add the column from the old data, which might not even be the same size as the new data.

Thus one needs to carefully split up the pipeline and only save/use the correct cache which is not so nice.

Problem with revert and Select:

julia> p = Select(:a)  MinMax()
Sequential(TableTransforms.Transform[Select{Tuple{Symbol}}((:a,)), Scale{Int64}(0, 1)])

julia> t = (a=rand(10), c=rand(10));

julia> _, cache = apply(p, t);

julia> z = reapply(p, (a=rand(3), c=rand(3)), cache);

julia> length(revert(p, z, cache).c)
10

@juliohm
Copy link
Member

juliohm commented May 25, 2022

I think the issue we are facing here is more profound. It has to do with the fact that Select's revertibility is tied to a specific input table. When you reapply the pipeline to a new table the cache isn't modified and so you cannot "unselect" the columns of the new table that were never stored. I wonder what could be done differently?

What is your proposal for this new Restrict transform and how it would address the issue above?

@juliohm
Copy link
Member

juliohm commented May 30, 2022

@antholzer I will close this issue, but feel free to reopen it if you have a proposal that we could brainstorm further.

@juliohm juliohm closed this as completed May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants