Restrict transform to columns? #81

antholzer · 2022-05-23T08:43:09Z

At the moment in order to apply a Colwise transformation to a selected column(s) one has to use Select or is there another way?

I am finding that Select has some drawbacks when used in this way since it results in all the non selected columns being put into the cache but for e.g. the following pipeline this would not be necessary

(Select(:a) → MinMax()) ⊔ (Select(:b) → ZScore())

If I have a large pipeline/table, getting a large cache could be annoying . There is also an issue with revert #80.

So would it make sense to introduce a wrapper transform (or something else), where we give it a subset of columns, that only applies the (Colwise) transform to the subset of columns? For the pipeline above this could for example look like the following

Restrict(:a)(MinMax()) → Restrict(:b)(ZScore())

The text was updated successfully, but these errors were encountered:

juliohm · 2022-05-23T10:46:10Z

Can you elaborate on why the large cache is an issue in this scenario? Is it because you expected more performance? Or because the memory footprint is too high for the application?

Assuming that it is indeed an issue, would it make sense to "materialize" the pipeline at a specific step? For example, you can always drop the cache by evaluating the pipeline with pipeline(table) and then take the resulting table and insert it into a second pipeline. Would that work?

I am asking these questions because we could consider adding a column selection feature to all Colwise transforms, but this increased code complexity must be justified with a popular use case.

antholzer · 2022-05-24T15:00:54Z

It is not really an issue for my application since the amount of data is not large.

In my use case I save the cache of the pipeline to later use it on new data samples. Saving/Using the cache of Select is a problem in this case since revert would re-add the column from the old data, which might not even be the same size as the new data.

Thus one needs to carefully split up the pipeline and only save/use the correct cache which is not so nice.

Problem with revert and Select:

julia> p = Select(:a) → MinMax()
Sequential(TableTransforms.Transform[Select{Tuple{Symbol}}((:a,)), Scale{Int64}(0, 1)])

julia> t = (a=rand(10), c=rand(10));

julia> _, cache = apply(p, t);

julia> z = reapply(p, (a=rand(3), c=rand(3)), cache);

julia> length(revert(p, z, cache).c)
10

juliohm · 2022-05-25T14:14:24Z

I think the issue we are facing here is more profound. It has to do with the fact that Select's revertibility is tied to a specific input table. When you reapply the pipeline to a new table the cache isn't modified and so you cannot "unselect" the columns of the new table that were never stored. I wonder what could be done differently?

What is your proposal for this new Restrict transform and how it would address the issue above?

juliohm · 2022-05-30T21:32:43Z

@antholzer I will close this issue, but feel free to reopen it if you have a proposal that we could brainstorm further.

juliohm added the discussion label May 23, 2022

juliohm closed this as completed May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict transform to columns? #81

Restrict transform to columns? #81

antholzer commented May 23, 2022

juliohm commented May 23, 2022

antholzer commented May 24, 2022

juliohm commented May 25, 2022

juliohm commented May 30, 2022

Restrict transform to columns? #81

Restrict transform to columns? #81

Comments

antholzer commented May 23, 2022

juliohm commented May 23, 2022

antholzer commented May 24, 2022

juliohm commented May 25, 2022

juliohm commented May 30, 2022