Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast reductions #2869

Merged
merged 54 commits into from
Oct 23, 2021
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b6ca433
add fast reduction for sum
bkamins Sep 9, 2021
b58ced8
first approach to summation with missings
bkamins Sep 11, 2021
1e5cb69
improve error message
bkamins Sep 11, 2021
748e609
implement safer reduction
bkamins Sep 12, 2021
d6339c0
add conversion
bkamins Sep 12, 2021
d1d6354
Update src/abstractdataframe/selection.jl
bkamins Sep 15, 2021
0d90feb
Update src/abstractdataframe/selection.jl
bkamins Sep 15, 2021
8f64633
Update src/abstractdataframe/selection.jl
bkamins Sep 15, 2021
97a98b9
Update src/abstractdataframe/selection.jl
bkamins Sep 15, 2021
5b88db8
Update src/abstractdataframe/selection.jl
bkamins Sep 15, 2021
adeb553
better handle sumz
bkamins Sep 15, 2021
59aef38
Merge branch 'bk/fast_sum' of https://github.com/JuliaData/DataFrames…
bkamins Sep 15, 2021
b972658
changes after code review
bkamins Sep 15, 2021
9d11641
fix typos
bkamins Sep 15, 2021
7924736
refactor code
bkamins Sep 15, 2021
98c48f2
improve implementation to make it use dispatch
bkamins Sep 15, 2021
54735f6
fix typo
bkamins Sep 15, 2021
f1b631c
add length and length with skipmissing
bkamins Sep 17, 2021
4c55191
add mean
bkamins Sep 18, 2021
7f0b152
split fast path to a separate file
bkamins Sep 18, 2021
7390702
add minium, maximum, min and max
bkamins Sep 18, 2021
77a3424
Apply suggestions from code review
bkamins Sep 20, 2021
0dcf6f5
use Base.add_sum + fix @noinline
bkamins Sep 20, 2021
9f56c00
Merge branch 'main' into bk/fast_sum
bkamins Sep 28, 2021
9b91f72
Merge branch 'bk/fast_sum' of https://github.com/JuliaData/DataFrames…
bkamins Sep 28, 2021
618832b
fix implementations
bkamins Sep 28, 2021
36450bc
finished design
bkamins Oct 16, 2021
903e57f
add tests of positional reductions
bkamins Oct 16, 2021
286fca0
added length tests
bkamins Oct 16, 2021
0473cd2
done sum testing
bkamins Oct 16, 2021
e06bc32
additional sum tests
bkamins Oct 16, 2021
a62b73f
finish mean et al. tests
bkamins Oct 16, 2021
e4536c6
add minimum and maximum tests
bkamins Oct 16, 2021
083a07f
Merge branch 'main' into bk/fast_sum
bkamins Oct 16, 2021
ba0208e
remove @show
bkamins Oct 16, 2021
a39aefd
update tests and docstring
bkamins Oct 17, 2021
1c0022b
fixes of x86 arch and Julia 1.0 problems
bkamins Oct 17, 2021
277bb24
fix 32-bit Julia issue
bkamins Oct 18, 2021
a766a92
fix more Julia 1.0.5 errors
bkamins Oct 18, 2021
a649925
Apply suggestions from code review
bkamins Oct 18, 2021
e53a7a2
improve docs
bkamins Oct 18, 2021
cc086d7
Merge branch 'bk/fast_sum' of https://github.com/JuliaData/DataFrames…
bkamins Oct 18, 2021
5680902
fix typo
bkamins Oct 18, 2021
cd5acdf
Fix code and add tests for Int32
bkamins Oct 18, 2021
54eed61
additional tests
bkamins Oct 19, 2021
4c46bca
Update src/abstractdataframe/selectionfast.jl
bkamins Oct 19, 2021
39790da
Update docs/src/lib/internals.md
bkamins Oct 19, 2021
0bfbc4a
Merge branch 'main' into bk/fast_sum
bkamins Oct 19, 2021
99d459e
update tests
bkamins Oct 19, 2021
05e6031
add NEWS.md
bkamins Oct 20, 2021
bb59dd7
Apply suggestions from code review
bkamins Oct 22, 2021
2fa0e01
0-length selection corner cases handling
bkamins Oct 22, 2021
8c83f36
Merge branch 'bk/fast_sum' of https://github.com/JuliaData/DataFrames…
bkamins Oct 22, 2021
07e47a1
fix Julia 1.0 and nightly
bkamins Oct 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ repeat
repeat!
select
select!
table_transformation
transform
transform!
vcat
Expand Down
11 changes: 11 additions & 0 deletions docs/src/lib/internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,15 @@ getmaxwidths
ourshow
ourstrwidth
@spawn_for_chunks
default_table_transformation
```

Note! When `AsTable` is used as source column selector in the
`source => function => target` mini-language supported by `select` and related
functions it is possible to override the default processing performed by
function `function` by adding a [`table_transformation`](@ref) method for this
function. This is most useful for custom reductions over columns of `NamedTuple`
created by `AsTable`, especially in cases when the user expects that very many
(over 1000 as a rule of thumb) would be selected by `AsTable` selector in which
case avoiding creation of `NamedTuple` object significantly reduces compilation
time (which is often more longer than computation time in such cases).
4 changes: 2 additions & 2 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ as following arguments. The second type of signature is when a `Function` or a `
is passed as the first argument and a `GroupedDataFrame` as the second argument
(similar to `map`).

As a special rule, with the `cols => function` and `cols => function =>
target_cols` syntaxes, if `cols` is wrapped in an `AsTable`
As a special rule, with the `cols => function` and
`cols => function => target_cols` syntaxes, if `cols` is wrapped in an `AsTable`
object then a `NamedTuple` containing columns selected by `cols` is passed to
`function`.

Expand Down
1 change: 1 addition & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ include("groupeddataframe/utils.jl")
include("other/broadcasting.jl")

include("abstractdataframe/selection.jl")
include("abstractdataframe/selectionfast.jl")
include("abstractdataframe/subset.jl")
include("abstractdataframe/iteration.jl")
include("abstractdataframe/reshape.jl")
Expand Down
25 changes: 19 additions & 6 deletions src/abstractdataframe/selection.jl
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ const TRANSFORMATION_COMMON_RULES =
the target column or columns, which must be a single name (as a `Symbol` or a string),
a vector of names or `AsTable`. Additionally it can be a `Function` which
takes a string or a vector of strings as an argument containing names of columns
selected by `cols`, and returns the target columns names (all accepted types
except `AsTable` are allowed).
selected by `cols`, and returns the target columns names (all accepted types
except `AsTable` are allowed).
4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows
Expand All @@ -62,6 +62,9 @@ const TRANSFORMATION_COMMON_RULES =
is small or a very large number of columns are processed
(in which case `SubDataFrame` avoids excessive compilation)

In order to improve the performance of the operations some transformations
invoke optimized implementation, see [`table_transformation`](@ref) for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this at the end just after info about parallel operation? This is probably the last thing users need to know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


Note! If the expression of the form `x => y` is passed then except for the special
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, was this supposed to use the !!! note syntax?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a docstring so I would leave it as it is now.

convenience form `nrow => target_cols` it is always interpreted as
`cols => function`. In particular the following expression `function => target_cols`
Expand Down Expand Up @@ -400,11 +403,11 @@ _transformation_helper(df::AbstractDataFrame, col_idx::Int, (fun,)::Ref{Any}) =
_empty_astable_helper(fun, len) = [fun(NamedTuple()) for _ in 1:len]

function _transformation_helper(df::AbstractDataFrame, col_idx::AsTable, (fun,)::Ref{Any})
tbl = Tables.columntable(select(df, col_idx.cols, copycols=false))
if isempty(tbl) && fun isa ByRow
df_sel = select(df, col_idx.cols, copycols=false)
if ncol(df_sel) == 0 && fun isa ByRow
return _empty_astable_helper(fun.fun, nrow(df))
else
return fun(tbl)
return table_transformation(df_sel, fun)
end
end

Expand All @@ -415,7 +418,17 @@ function _transformation_helper(df::AbstractDataFrame, col_idx::AbstractVector{I
return _empty_selector_helper(fun.fun, nrow(df))
else
cdf = eachcol(df)
return fun(map(c -> cdf[c], col_idx)...)
cols = map(c -> cdf[c], col_idx)
if (fun === +) || fun === ByRow(+) # removing parentheses leads to a parsing error
isempty(cols) && return +() # to make sure we produce a consistent error
return reduce(+, cols)
elseif fun === ByRow(min)
return _minmax_row_fast(cols, min)
elseif fun === ByRow(max)
return _minmax_row_fast(cols, max)
else
return fun(cols...)
end
end
end

Expand Down
Loading