Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast reductions #2869

Merged
merged 54 commits into from
Oct 23, 2021
Merged

Add fast reductions #2869

merged 54 commits into from
Oct 23, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 9, 2021

Fixes #2439 #2768 #2440

I have implemented a proof of concept for sum. Let us discuss if we like this approach. If we do we can extend the list of supported reductions (and comments what reductions we would like to support are welcome).

CC @nalimilan @pdeffebach

@bkamins bkamins added the feature label Sep 9, 2021
@bkamins bkamins added this to the 1.3 milestone Sep 9, 2021
@bkamins
Copy link
Member Author

bkamins commented Sep 9, 2021

TODO: when we decide on what we do update documentation and NEWS.md

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2021

@nalimilan The thing to discuss is the following tension:
This is what we currently do:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Real
─────┼───────────────────────
   1 │                   5.0
   2 │                   2
   3 │                   0

julia> df = DataFrame(x1=Any[missing,2,3], x2=Any[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2      
     │ Any      Any     
─────┼──────────────────
   1 │ missing  4.0     
   2 │ 2        missing 
   3 │ 3        missing 

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Real
─────┼───────────────────────
   1 │                   4.0 
   2 │                   2   
   3 │                   3   

Which I think is not very nice. I would rather prefer to do promote_type of eltype of all columns and based on this determine the eltype of the output. In the first case above it would nicely produce a vector of Float64 - which I think is preferable.
However, the consequence is that having vectors of Any would be problematic as we would not be able to establish 0 in this case. What I would propose to do is to internally narrow-down the eltype of such vectors. If it is still Any - error (as it means we have non numeric or missing values in the column). Otherwise perform the calculation using the columns with narrowed down eltype. What do you think?

@nalimilan
Copy link
Member

Use promote_type would indeed make sense. That would be a difference between ByRow and plain broadcasting, right?

Regarding vector with eltype Any, I'd rather not make any particular efforts to make them work. Ideally we would do the same thing as without the fast path, even if that involves taking the slow path for them.

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2021

That would be a difference between ByRow and plain broadcasting, right?

It is unrelated to ByRow/broadcasting. Neither map nor broadcasting do promote_type the way I propose (as you see in my examples the problem is that each row is treated differently based on mixture of information about both eltypes of columns and values stored in a given row). What I propose is essentially to determine the result type based ONLY on eltypes of columns (which is superior to what what we have now except for the case of columns with Any type).

Ideally we would do the same thing as without the fast path, even if that involves taking the slow path for them.

The problem is that taking a slow path will not yield any result as for wide inputs Julia crashes - this PR is meant to avoid these crashes in the first place.
Instead of narrowing of eltype we could also just error if the column has Any eltype and produce an informative error message. I think in general we should discourage users from using Any eltype in columns.

@bkamins
Copy link
Member Author

bkamins commented Sep 11, 2021

I have pushed the proposal for summation with skipping missings. It works as follows:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0 
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Float64
─────┼───────────────────────
   1 │                   5.0 
   2 │                   2.0 
   3 │                   0.0 

julia> df = DataFrame(x1=Any[missing,2,3], x2=Any[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2      
     │ Any      Any     
─────┼──────────────────
   1 │ missing  4.0     
   2 │ 2        missing 
   3 │ 3        missing 

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
ERROR: ArgumentError: The reduced element type Any does not support zero that is required to perform summation. Narrowing down element types of passed columns should be performed first.    

It is possible to try harder to deduce the summation result in the second case, but I am not sure it is useful (it would add a lot of complexity to the code in the case that I am not sure we want to support anyway).

@nalimilan
Copy link
Member

I see two difficult points to address:

  • Changing the behavior of ByRow is breaking. Minor changes to the eltype are probably fine (especially if we make them more narrow). But throwing an error in cases that don't is problematic, and that would be the case for Any: even though it crashes with a large number of columns, it works in other cases. That's why I think that falling back to the current slow path would make sense to avoid any regressions.
  • The fast path for aggregations should be an implementation detail. ByRow(f) should be equivalent to ByRow(x -> f(x)), modulo minor differences like rounding errors or possibly the former working in cases where the latter fails. So if we use promote_type for reductions, we should probably use it for all functions (even if that implies converting the vector returned by map manually).

@bkamins
Copy link
Member Author

bkamins commented Sep 11, 2021

That's why I think that falling back to the current slow path would make sense to avoid any regressions.

OK - I will change it this way

The fast path for aggregations should be an implementation detail. ByRow(f) should be equivalent to ByRow(x -> f(x))

I do not think it is possible. This is exactly the same problem with as fast aggegations for GroupedDataFrame.

If I write AsTable(some_cols) => ByRow(f) I do not know how f would use the NamedTuple passed. E.g. it might rely on column names etc.

The other problem is that even if we did some way to signal that f is a reduction the general approach will be slower than what is proposed here as in e.g. sum or sum∘skipmissing we explicitly use the knowledge of the way the reduction should be performed to make it efficient.

@nalimilan
Copy link
Member

I do not think it is possible. This is exactly the same problem with as fast aggegations for GroupedDataFrame.

If I write AsTable(some_cols) => ByRow(f) I do not know how f would use the NamedTuple passed. E.g. it might rely on column names etc.

The other problem is that even if we did some way to signal that f is a reduction the general approach will be slower than what is proposed here as in e.g. sum or sum∘skipmissing we explicitly use the knowledge of the way the reduction should be performed to make it efficient.

What do you mean with "not possible"? What I suggest is to only enable optimizations for known aggregation functions for which we know the equivalence holds -- just like for grouped reduction. Isn't that possible here, or do you just mean that it's not possible in general for any function?

@bkamins
Copy link
Member Author

bkamins commented Sep 12, 2021

Ah, you mean with:

ByRow(f) should be equivalent to ByRow(x -> f(x))

That the output should be identical - not the code path. Then it is possible but it will be slow and I am not sure if it is desirable. See the first example:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Real
─────┼───────────────────────
   1 │                   5.0
   2 │                   2
   3 │                   0

Retaining this behavior is possible, but will be complex to achieve and I would argue that the result is not something I think most people would expect or want.

@nalimilan
Copy link
Member

I'd rather do the reverse actually: always choose the return type using promote_type, even if that implies making a copy with a more narrow type after the fact in the general case. That's doable, right?

@bkamins
Copy link
Member Author

bkamins commented Sep 12, 2021

What is doable is the following (and this is the correct way to do it in fast path, as my current implementation is incorrect):

  • fast path:
    • try doing zero on eltype of each column, except if some column has Missing eltype - then ignore such column;
    • if it fails - fallback to slow path;
    • if it works - sum these zeros. Use the sum of zeros to create the init for reduction;
  • slow path (i.e. some column eltype does not support zero): do what we do (and accept that in corner cases it will error), then do promote_type on all values in the produced vector; convert the produced vector to the vector with the determined type.

I will now implement this to show how it will look (as I assume this is what you agree with 😄, but of course if you do not like it we can change it).

@bkamins
Copy link
Member Author

bkamins commented Sep 12, 2021

OK - I have pushed some tentative implementation. I need to check its correctness and performance.

@bkamins
Copy link
Member Author

bkamins commented Sep 12, 2021

It does the following now:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0 
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Float64
─────┼───────────────────────
   1 │                   5.0 
   2 │                   2.0 
   3 │                   0.0 

julia> df = DataFrame(x1=Any[missing,2,3], x2=Any[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2      
     │ Any      Any     
─────┼──────────────────
   1 │ missing  4.0     
   2 │ 2        missing 
   3 │ 3        missing 

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Float64
─────┼───────────────────────
   1 │                   4.0 
   2 │                   2.0 
   3 │                   3.0 

@bkamins
Copy link
Member Author

bkamins commented Sep 12, 2021

And the timing is as follows:

julia> df = DataFrame(rand(10_000, 10_000), :auto);

julia> @time select(df, AsTable(:) => ByRow(sum∘skipmissing));
  0.082187 seconds (19.66 k allocations: 1.142 MiB)

in comparison to:

julia> @time sum(eachcol(df));
  0.461784 seconds (20.00 k allocations: 763.626 MiB, 41.38% gc time)

and

julia> x = collect(eachcol(df));

julia> @time sum(x);
  0.400074 seconds (20.00 k allocations: 763.626 MiB, 32.77% gc time)

julia> m = Matrix(df);

julia> @time sum(m, dims=2);
  0.054808 seconds (6 allocations: 78.281 KiB)

So we have some overhead over Matrix but I think it is acceptable.

@bkamins
Copy link
Member Author

bkamins commented Sep 12, 2021

If I get an approval for this design I will update docs and tests. Then I will move forward with other reductions.
@nalimilan - can you please list which reductions you think it is worth to add? (as I would prefer to focus on selected that are really worth adding and not do any conceivable one)

@nalimilan
Copy link
Member

Sounds good. I'd implement the same reductions as in grouping. Actually we could maybe reuse the same wrapper type, which could in theory allow users to implement their own reductions.

bkamins and others added 4 commits September 15, 2021 20:37
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@@ -62,6 +62,9 @@ const TRANSFORMATION_COMMON_RULES =
is small or a very large number of columns are processed
(in which case `SubDataFrame` avoids excessive compilation)

In order to improve the performance of the operations some transformations
invoke optimized implementation, see [`table_transformation`](@ref) for details.

Note! If the expression of the form `x => y` is passed then except for the special
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, was this supposed to use the !!! note syntax?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a docstring so I would leave it as it is now.

Comment on lines 65 to 66
In order to improve the performance of the operations some transformations
invoke optimized implementation, see [`table_transformation`](@ref) for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this at the end just after info about parallel operation? This is probably the last thing users need to know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

reduce(+, collect(eachcol(df)))
@test combine(df, All() => ByRow(min) => :min).min == minimum.(eachrow(m))
@test combine(df, All() => ByRow(max) => :max).max == maximum.(eachrow(m))
@test combine(df, All() => (+) => :sum).sum isa Vector{BigFloat}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you confident that the eltype of the result is tested systematically for all code paths? It could make sense to use a custom Unicode equality operator to check both value and type and use is systematically for all == checks here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will add it (though I am pretty confident that I made the checks where they were relevant).

bkamins and others added 3 commits October 19, 2021 22:07
@bkamins
Copy link
Member Author

bkamins commented Oct 20, 2021

This should be good for another round of reviews.

Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have a much better understanding of what I need to add to DataFramesMeta.jl to support this

```

When `AsTable` is used as source column selector in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is table_transformation a public facing API? Unclear from these docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2021

@nalimilan - when reviewing before merging I thought to make more precise handling of 0-length selections which are always tricky. I have pushed a commit that handles this more consistently + more tests making sure all works as expected. Can you please have a quick look at it? Thank you!

@bkamins bkamins merged commit 8287ba7 into main Oct 23, 2021
@bkamins bkamins deleted the bk/fast_sum branch October 23, 2021 21:02
@bkamins
Copy link
Member Author

bkamins commented Oct 23, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants