Add fast reductions #2869

bkamins · 2021-09-09T19:06:34Z

I have implemented a proof of concept for sum. Let us discuss if we like this approach. If we do we can extend the list of supported reductions (and comments what reductions we would like to support are welcome).

CC @nalimilan @pdeffebach

bkamins · 2021-09-09T19:08:02Z

TODO: when we decide on what we do update documentation and NEWS.md

bkamins · 2021-09-10T15:54:43Z

@nalimilan The thing to discuss is the following tension:
This is what we currently do:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Real
─────┼───────────────────────
   1 │                   5.0
   2 │                   2
   3 │                   0

julia> df = DataFrame(x1=Any[missing,2,3], x2=Any[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2      
     │ Any      Any     
─────┼──────────────────
   1 │ missing  4.0     
   2 │ 2        missing 
   3 │ 3        missing 

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Real
─────┼───────────────────────
   1 │                   4.0 
   2 │                   2   
   3 │                   3

Which I think is not very nice. I would rather prefer to do promote_type of eltype of all columns and based on this determine the eltype of the output. In the first case above it would nicely produce a vector of Float64 - which I think is preferable.
However, the consequence is that having vectors of Any would be problematic as we would not be able to establish 0 in this case. What I would propose to do is to internally narrow-down the eltype of such vectors. If it is still Any - error (as it means we have non numeric or missing values in the column). Otherwise perform the calculation using the columns with narrowed down eltype. What do you think?

nalimilan · 2021-09-10T16:37:40Z

Use promote_type would indeed make sense. That would be a difference between ByRow and plain broadcasting, right?

Regarding vector with eltype Any, I'd rather not make any particular efforts to make them work. Ideally we would do the same thing as without the fast path, even if that involves taking the slow path for them.

bkamins · 2021-09-10T21:44:19Z

That would be a difference between ByRow and plain broadcasting, right?

It is unrelated to ByRow/broadcasting. Neither map nor broadcasting do promote_type the way I propose (as you see in my examples the problem is that each row is treated differently based on mixture of information about both eltypes of columns and values stored in a given row). What I propose is essentially to determine the result type based ONLY on eltypes of columns (which is superior to what what we have now except for the case of columns with Any type).

Ideally we would do the same thing as without the fast path, even if that involves taking the slow path for them.

The problem is that taking a slow path will not yield any result as for wide inputs Julia crashes - this PR is meant to avoid these crashes in the first place.
Instead of narrowing of eltype we could also just error if the column has Any eltype and produce an informative error message. I think in general we should discourage users from using Any eltype in columns.

bkamins · 2021-09-11T08:14:50Z

I have pushed the proposal for summation with skipping missings. It works as follows:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0 
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Float64
─────┼───────────────────────
   1 │                   5.0 
   2 │                   2.0 
   3 │                   0.0 

julia> df = DataFrame(x1=Any[missing,2,3], x2=Any[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2      
     │ Any      Any     
─────┼──────────────────
   1 │ missing  4.0     
   2 │ 2        missing 
   3 │ 3        missing 

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
ERROR: ArgumentError: The reduced element type Any does not support zero that is required to perform summation. Narrowing down element types of passed columns should be performed first.

It is possible to try harder to deduce the summation result in the second case, but I am not sure it is useful (it would add a lot of complexity to the code in the case that I am not sure we want to support anyway).

nalimilan · 2021-09-11T12:29:05Z

I see two difficult points to address:

Changing the behavior of ByRow is breaking. Minor changes to the eltype are probably fine (especially if we make them more narrow). But throwing an error in cases that don't is problematic, and that would be the case for Any: even though it crashes with a large number of columns, it works in other cases. That's why I think that falling back to the current slow path would make sense to avoid any regressions.
The fast path for aggregations should be an implementation detail. ByRow(f) should be equivalent to ByRow(x -> f(x)), modulo minor differences like rounding errors or possibly the former working in cases where the latter fails. So if we use promote_type for reductions, we should probably use it for all functions (even if that implies converting the vector returned by map manually).

bkamins · 2021-09-11T19:40:26Z

That's why I think that falling back to the current slow path would make sense to avoid any regressions.

OK - I will change it this way

The fast path for aggregations should be an implementation detail. ByRow(f) should be equivalent to ByRow(x -> f(x))

I do not think it is possible. This is exactly the same problem with as fast aggegations for GroupedDataFrame.

If I write AsTable(some_cols) => ByRow(f) I do not know how f would use the NamedTuple passed. E.g. it might rely on column names etc.

The other problem is that even if we did some way to signal that f is a reduction the general approach will be slower than what is proposed here as in e.g. sum or sum∘skipmissing we explicitly use the knowledge of the way the reduction should be performed to make it efficient.

nalimilan · 2021-09-12T16:03:54Z

I do not think it is possible. This is exactly the same problem with as fast aggegations for GroupedDataFrame.

If I write AsTable(some_cols) => ByRow(f) I do not know how f would use the NamedTuple passed. E.g. it might rely on column names etc.

The other problem is that even if we did some way to signal that f is a reduction the general approach will be slower than what is proposed here as in e.g. sum or sum∘skipmissing we explicitly use the knowledge of the way the reduction should be performed to make it efficient.

What do you mean with "not possible"? What I suggest is to only enable optimizations for known aggregation functions for which we know the equivalence holds -- just like for grouped reduction. Isn't that possible here, or do you just mean that it's not possible in general for any function?

bkamins · 2021-09-12T16:17:45Z

Ah, you mean with:

ByRow(f) should be equivalent to ByRow(x -> f(x))

That the output should be identical - not the code path. Then it is possible but it will be slow and I am not sure if it is desirable. See the first example:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Real
─────┼───────────────────────
   1 │                   5.0
   2 │                   2
   3 │                   0

Retaining this behavior is possible, but will be complex to achieve and I would argue that the result is not something I think most people would expect or want.

nalimilan · 2021-09-12T20:07:13Z

I'd rather do the reverse actually: always choose the return type using promote_type, even if that implies making a copy with a more narrow type after the fact in the general case. That's doable, right?

bkamins · 2021-09-12T20:21:34Z

What is doable is the following (and this is the correct way to do it in fast path, as my current implementation is incorrect):

fast path:
- try doing zero on eltype of each column, except if some column has Missing eltype - then ignore such column;
- if it fails - fallback to slow path;
- if it works - sum these zeros. Use the sum of zeros to create the init for reduction;
slow path (i.e. some column eltype does not support zero): do what we do (and accept that in corner cases it will error), then do promote_type on all values in the produced vector; convert the produced vector to the vector with the determined type.

I will now implement this to show how it will look (as I assume this is what you agree with 😄, but of course if you do not like it we can change it).

bkamins · 2021-09-12T20:45:31Z

OK - I have pushed some tentative implementation. I need to check its correctness and performance.

bkamins · 2021-09-12T20:55:21Z

It does the following now:

julia> df = DataFrame(x1=[1,2,missing], x2=[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2        
     │ Int64?   Float64?  
─────┼────────────────────
   1 │       1        4.0 
   2 │       2  missing   
   3 │ missing  missing   

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Float64
─────┼───────────────────────
   1 │                   5.0 
   2 │                   2.0 
   3 │                   0.0 

julia> df = DataFrame(x1=Any[missing,2,3], x2=Any[4.0,missing,missing])
3×2 DataFrame
 Row │ x1       x2      
     │ Any      Any     
─────┼──────────────────
   1 │ missing  4.0     
   2 │ 2        missing 
   3 │ 3        missing 

julia> select(df, AsTable(:) => ByRow(sum∘skipmissing))
3×1 DataFrame
 Row │ x1_x2_sum_skipmissing 
     │ Float64
─────┼───────────────────────
   1 │                   4.0 
   2 │                   2.0 
   3 │                   3.0

bkamins · 2021-09-12T20:58:36Z

And the timing is as follows:

julia> df = DataFrame(rand(10_000, 10_000), :auto);

julia> @time select(df, AsTable(:) => ByRow(sum∘skipmissing));
  0.082187 seconds (19.66 k allocations: 1.142 MiB)

in comparison to:

julia> @time sum(eachcol(df));
  0.461784 seconds (20.00 k allocations: 763.626 MiB, 41.38% gc time)

and

julia> x = collect(eachcol(df));

julia> @time sum(x);
  0.400074 seconds (20.00 k allocations: 763.626 MiB, 32.77% gc time)

julia> m = Matrix(df);

julia> @time sum(m, dims=2);
  0.054808 seconds (6 allocations: 78.281 KiB)

So we have some overhead over Matrix but I think it is acceptable.

bkamins · 2021-09-12T21:00:37Z

If I get an approval for this design I will update docs and tests. Then I will move forward with other reductions.
@nalimilan - can you please list which reductions you think it is worth to add? (as I would prefer to focus on selected that are really worth adding and not do any conceivable one)

nalimilan · 2021-09-13T19:49:59Z

Sounds good. I'd implement the same reductions as in grouping. Actually we could maybe reuse the same wrapper type, which could in theory allow users to implement their own reductions.

src/abstractdataframe/selection.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/abstractdataframe/selectionfast.jl

test/select.jl

docs/src/man/split_apply_combine.md

src/abstractdataframe/selection.jl

src/abstractdataframe/selectionfast.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

into bk/fast_sum

test/select.jl

src/abstractdataframe/selectionfast.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

docs/src/lib/internals.md

nalimilan · 2021-10-19T17:39:23Z

src/abstractdataframe/selection.jl

@@ -62,6 +62,9 @@ const TRANSFORMATION_COMMON_RULES =
       is small or a very large number of columns are processed
       (in which case `SubDataFrame` avoids excessive compilation)

+    In order to improve the performance of the operations some transformations
+    invoke optimized implementation, see [`table_transformation`](@ref) for details.
+
    Note! If the expression of the form `x => y` is passed then except for the special


BTW, was this supposed to use the !!! note syntax?

It is a docstring so I would leave it as it is now.

nalimilan · 2021-10-19T17:42:45Z

src/abstractdataframe/selection.jl

+    In order to improve the performance of the operations some transformations
+    invoke optimized implementation, see [`table_transformation`](@ref) for details.


Maybe put this at the end just after info about parallel operation? This is probably the last thing users need to know.

nalimilan · 2021-10-19T19:50:38Z

test/select.jl

+          reduce(+, collect(eachcol(df)))
+    @test combine(df, All() => ByRow(min) => :min).min == minimum.(eachrow(m))
+    @test combine(df, All() => ByRow(max) => :max).max == maximum.(eachrow(m))
+    @test combine(df, All() => (+) => :sum).sum isa Vector{BigFloat}


Are you confident that the eltype of the result is tested systematically for all code paths? It could make sense to use a custom Unicode equality operator to check both value and type and use is systematically for all == checks here.

OK. I will add it (though I am pretty confident that I made the checks where they were relevant).

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-10-20T07:04:21Z

This should be good for another round of reviews.

pdeffebach

Thanks! I have a much better understanding of what I need to add to DataFramesMeta.jl to support this

pdeffebach · 2021-10-21T21:31:09Z

docs/src/lib/internals.md

 ```
+
+When `AsTable` is used as source column selector in the


Is table_transformation a public facing API? Unclear from these docs.

Yes it is, see https://github.com/JuliaData/DataFrames.jl/pull/2869/files#diff-83f595e5b8d93541942100bc3cb16d0c244b674bfc1b30bfbe043a4e7a5a1ebeR20

docs/src/lib/internals.md

into bk/fast_sum

bkamins · 2021-10-22T11:42:08Z

@nalimilan - when reviewing before merging I thought to make more precise handling of 0-length selections which are always tricky. I have pushed a commit that handles this more consistently + more tests making sure all works as expected. Can you please have a quick look at it? Thank you!

bkamins · 2021-10-23T21:03:02Z

Thank you!

add fast reduction for sum

b6ca433

bkamins added the feature label Sep 9, 2021

bkamins added this to the 1.3 milestone Sep 9, 2021

This was linked to issues Sep 9, 2021

Investigate performance of aggregations #2440

Closed

Fast row aggregation in DataFrames.jl #2768

Closed

bkamins added 2 commits September 11, 2021 10:04

first approach to summation with missings

b58ced8

improve error message

1e5cb69

bkamins added 2 commits September 12, 2021 22:39

implement safer reduction

748e609

add conversion

d6339c0

nalimilan reviewed Sep 13, 2021

View reviewed changes

bkamins and others added 4 commits September 15, 2021 20:37

Update src/abstractdataframe/selection.jl

d1d6354

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/abstractdataframe/selection.jl

0d90feb

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/abstractdataframe/selection.jl

8f64633

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/abstractdataframe/selection.jl

97a98b9

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

fix more Julia 1.0.5 errors

a766a92

nalimilan reviewed Oct 18, 2021

View reviewed changes

bkamins and others added 5 commits October 18, 2021 20:40

Apply suggestions from code review

a649925

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

improve docs

e53a7a2

Merge branch 'bk/fast_sum' of https://github.com/JuliaData/DataFrames.jl

cc086d7

into bk/fast_sum

fix typo

5680902

Fix code and add tests for Int32

cd5acdf

bkamins commented Oct 18, 2021

View reviewed changes

test/select.jl Show resolved Hide resolved

additional tests

54eed61

nalimilan reviewed Oct 19, 2021

View reviewed changes

src/abstractdataframe/selectionfast.jl Outdated Show resolved Hide resolved

src/abstractdataframe/selectionfast.jl Show resolved Hide resolved

Update src/abstractdataframe/selectionfast.jl

4c46bca

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan reviewed Oct 19, 2021

View reviewed changes

bkamins and others added 3 commits October 19, 2021 22:07

Update docs/src/lib/internals.md

39790da

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Merge branch 'main' into bk/fast_sum

0bfbc4a

update tests

99d459e

add NEWS.md

05e6031

nalimilan approved these changes Oct 21, 2021

View reviewed changes

pdeffebach reviewed Oct 21, 2021

View reviewed changes

bkamins commented Oct 22, 2021

View reviewed changes

docs/src/lib/internals.md Outdated Show resolved Hide resolved

bkamins commented Oct 22, 2021

View reviewed changes

docs/src/lib/internals.md Outdated Show resolved Hide resolved

bkamins added 3 commits October 22, 2021 11:32

Apply suggestions from code review

bb59dd7

0-length selection corner cases handling

2fa0e01

Merge branch 'bk/fast_sum' of https://github.com/JuliaData/DataFrames.jl

8c83f36

into bk/fast_sum

fix Julia 1.0 and nightly

07e47a1

nalimilan approved these changes Oct 23, 2021

View reviewed changes

bkamins merged commit 8287ba7 into main Oct 23, 2021

bkamins deleted the bk/fast_sum branch October 23, 2021 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast reductions #2869

Add fast reductions #2869

bkamins commented Sep 9, 2021

bkamins commented Sep 9, 2021

bkamins commented Sep 10, 2021

nalimilan commented Sep 10, 2021

bkamins commented Sep 10, 2021

bkamins commented Sep 11, 2021

nalimilan commented Sep 11, 2021

bkamins commented Sep 11, 2021

nalimilan commented Sep 12, 2021

bkamins commented Sep 12, 2021

nalimilan commented Sep 12, 2021

bkamins commented Sep 12, 2021 •

edited

Loading

bkamins commented Sep 12, 2021

bkamins commented Sep 12, 2021

bkamins commented Sep 12, 2021

bkamins commented Sep 12, 2021

nalimilan commented Sep 13, 2021

nalimilan Oct 19, 2021

bkamins Oct 19, 2021

nalimilan Oct 19, 2021

bkamins Oct 19, 2021

nalimilan Oct 19, 2021

bkamins Oct 19, 2021

bkamins commented Oct 20, 2021

pdeffebach left a comment

pdeffebach Oct 21, 2021

bkamins Oct 22, 2021

bkamins commented Oct 22, 2021

bkamins commented Oct 23, 2021

		In order to improve the performance of the operations some transformations
		invoke optimized implementation, see [`table_transformation`](@ref) for details.

Add fast reductions #2869

Add fast reductions #2869

Conversation

bkamins commented Sep 9, 2021

bkamins commented Sep 9, 2021

bkamins commented Sep 10, 2021

nalimilan commented Sep 10, 2021

bkamins commented Sep 10, 2021

bkamins commented Sep 11, 2021

nalimilan commented Sep 11, 2021

bkamins commented Sep 11, 2021

nalimilan commented Sep 12, 2021

bkamins commented Sep 12, 2021

nalimilan commented Sep 12, 2021

bkamins commented Sep 12, 2021 • edited Loading

bkamins commented Sep 12, 2021

bkamins commented Sep 12, 2021

bkamins commented Sep 12, 2021

bkamins commented Sep 12, 2021

nalimilan commented Sep 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Oct 20, 2021

pdeffebach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Oct 22, 2021

bkamins commented Oct 23, 2021

bkamins commented Sep 12, 2021 •

edited

Loading