-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast reductions #2869
Add fast reductions #2869
Conversation
TODO: when we decide on what we do update documentation and NEWS.md |
@nalimilan The thing to discuss is the following tension:
Which I think is not very nice. I would rather prefer to do |
Use Regarding vector with eltype |
It is unrelated to
The problem is that taking a slow path will not yield any result as for wide inputs Julia crashes - this PR is meant to avoid these crashes in the first place. |
I have pushed the proposal for summation with skipping missings. It works as follows:
It is possible to try harder to deduce the summation result in the second case, but I am not sure it is useful (it would add a lot of complexity to the code in the case that I am not sure we want to support anyway). |
I see two difficult points to address:
|
OK - I will change it this way
I do not think it is possible. This is exactly the same problem with as fast aggegations for If I write The other problem is that even if we did some way to signal that |
What do you mean with "not possible"? What I suggest is to only enable optimizations for known aggregation functions for which we know the equivalence holds -- just like for grouped reduction. Isn't that possible here, or do you just mean that it's not possible in general for any function? |
Ah, you mean with:
That the output should be identical - not the code path. Then it is possible but it will be slow and I am not sure if it is desirable. See the first example:
Retaining this behavior is possible, but will be complex to achieve and I would argue that the result is not something I think most people would expect or want. |
I'd rather do the reverse actually: always choose the return type using |
What is doable is the following (and this is the correct way to do it in fast path, as my current implementation is incorrect):
I will now implement this to show how it will look (as I assume this is what you agree with 😄, but of course if you do not like it we can change it). |
OK - I have pushed some tentative implementation. I need to check its correctness and performance. |
It does the following now:
|
And the timing is as follows:
in comparison to:
and
So we have some overhead over |
If I get an approval for this design I will update docs and tests. Then I will move forward with other reductions. |
Sounds good. I'd implement the same reductions as in grouping. Actually we could maybe reuse the same wrapper type, which could in theory allow users to implement their own reductions. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/abstractdataframe/selection.jl
Outdated
@@ -62,6 +62,9 @@ const TRANSFORMATION_COMMON_RULES = | |||
is small or a very large number of columns are processed | |||
(in which case `SubDataFrame` avoids excessive compilation) | |||
|
|||
In order to improve the performance of the operations some transformations | |||
invoke optimized implementation, see [`table_transformation`](@ref) for details. | |||
|
|||
Note! If the expression of the form `x => y` is passed then except for the special |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, was this supposed to use the !!! note
syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a docstring so I would leave it as it is now.
src/abstractdataframe/selection.jl
Outdated
In order to improve the performance of the operations some transformations | ||
invoke optimized implementation, see [`table_transformation`](@ref) for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this at the end just after info about parallel operation? This is probably the last thing users need to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
reduce(+, collect(eachcol(df))) | ||
@test combine(df, All() => ByRow(min) => :min).min == minimum.(eachrow(m)) | ||
@test combine(df, All() => ByRow(max) => :max).max == maximum.(eachrow(m)) | ||
@test combine(df, All() => (+) => :sum).sum isa Vector{BigFloat} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you confident that the eltype of the result is tested systematically for all code paths? It could make sense to use a custom Unicode equality operator to check both value and type and use is systematically for all ==
checks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will add it (though I am pretty confident that I made the checks where they were relevant).
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
This should be good for another round of reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have a much better understanding of what I need to add to DataFramesMeta.jl to support this
``` | ||
|
||
When `AsTable` is used as source column selector in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is table_transformation
a public facing API? Unclear from these docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - when reviewing before merging I thought to make more precise handling of 0-length selections which are always tricky. I have pushed a commit that handles this more consistently + more tests making sure all works as expected. Can you please have a quick look at it? Thank you! |
Thank you! |
Fixes #2439 #2768 #2440
I have implemented a proof of concept for
sum
. Let us discuss if we like this approach. If we do we can extend the list of supported reductions (and comments what reductions we would like to support are welcome).CC @nalimilan @pdeffebach