Skip to content

Commit

Permalink
Add fast reductions (#2869)
Browse files Browse the repository at this point in the history
  • Loading branch information
bkamins authored Oct 23, 2021
1 parent 95d752b commit 8287ba7
Show file tree
Hide file tree
Showing 8 changed files with 814 additions and 38 deletions.
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,13 @@
* fix a problem with not specialized `Pair` arguments passed as transformations
([#2889](https://github.com/JuliaData/DataFrames.jl/issues/2889))

## Performance improvements

* for selected common transformation specifications like e.g.
`AsTable(...) => ByRow(sum)` use a custom implementations that
lead to lower compilation latency and faster computation
([#2869](https://github.com/JuliaData/DataFrames.jl/pull/2869))

# DataFrames.jl v1.2.2 Patch Release Notes

## Bug fixes
Expand Down
1 change: 1 addition & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ repeat
repeat!
select
select!
table_transformation
transform
transform!
vcat
Expand Down
11 changes: 11 additions & 0 deletions docs/src/lib/internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,15 @@ getmaxwidths
ourshow
ourstrwidth
@spawn_for_chunks
default_table_transformation
```

When `AsTable` is used as source column selector in the
`source => function => target` mini-language supported by `select` and related
functions it is possible to override the default processing performed by
function `function` by adding a [`table_transformation`](@ref) method for this
function. This is most useful for custom reductions over columns of `NamedTuple`
created by `AsTable`, especially in cases when the user expects that very many
columns (over 1000 as a rule of thumb) would be selected by `AsTable` selector in which
case avoiding creation of `NamedTuple` object significantly reduces compilation
time (which is often longer than computation time in such cases).
4 changes: 2 additions & 2 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ as following arguments. The second type of signature is when a `Function` or a `
is passed as the first argument and a `GroupedDataFrame` as the second argument
(similar to `map`).

As a special rule, with the `cols => function` and `cols => function =>
target_cols` syntaxes, if `cols` is wrapped in an `AsTable`
As a special rule, with the `cols => function` and
`cols => function => target_cols` syntaxes, if `cols` is wrapped in an `AsTable`
object then a `NamedTuple` containing columns selected by `cols` is passed to
`function`.

Expand Down
1 change: 1 addition & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ include("groupeddataframe/utils.jl")
include("other/broadcasting.jl")

include("abstractdataframe/selection.jl")
include("abstractdataframe/selectionfast.jl")
include("abstractdataframe/subset.jl")
include("abstractdataframe/iteration.jl")
include("abstractdataframe/reshape.jl")
Expand Down
43 changes: 34 additions & 9 deletions src/abstractdataframe/selection.jl
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ const TRANSFORMATION_COMMON_RULES =
the target column or columns, which must be a single name (as a `Symbol` or a string),
a vector of names or `AsTable`. Additionally it can be a `Function` which
takes a string or a vector of strings as an argument containing names of columns
selected by `cols`, and returns the target columns names (all accepted types
except `AsTable` are allowed).
selected by `cols`, and returns the target columns names (all accepted types
except `AsTable` are allowed).
4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
must be single name (as a `Symbol` or a string), a vector of names or `AsTable`.
5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows
Expand Down Expand Up @@ -166,6 +166,9 @@ const TRANSFORMATION_COMMON_RULES =
variables (i.e. they should be pure), or use locks to control parallel accesses.
In the future, parallelism may be extended to other cases, so this requirement
also holds for `DataFrame` inputs.
In order to improve the performance of the operations some transformations
invoke optimized implementation, see [`table_transformation`](@ref) for details.
"""

"""
Expand Down Expand Up @@ -400,22 +403,44 @@ _transformation_helper(df::AbstractDataFrame, col_idx::Int, (fun,)::Ref{Any}) =
_empty_astable_helper(fun, len) = [fun(NamedTuple()) for _ in 1:len]

function _transformation_helper(df::AbstractDataFrame, col_idx::AsTable, (fun,)::Ref{Any})
tbl = Tables.columntable(select(df, col_idx.cols, copycols=false))
if isempty(tbl) && fun isa ByRow
return _empty_astable_helper(fun.fun, nrow(df))
df_sel = select(df, col_idx.cols, copycols=false)
if ncol(df_sel) == 0
if fun isa ByRow
# work around fact that length∘skipmissing is not supported in Julia Base yet
if fun === ByRow(lengthskipmissing)
return _empty_astable_helper(length, nrow(df))
else
return _empty_astable_helper(fun.fun, nrow(df))
end
else
return fun(NamedTuple())
end
else
return fun(tbl)
return table_transformation(df_sel, fun)
end
end

_empty_selector_helper(fun, len) = [fun() for _ in 1:len]

function _transformation_helper(df::AbstractDataFrame, col_idx::AbstractVector{Int}, (fun,)::Ref{Any})
if isempty(col_idx) && fun isa ByRow
return _empty_selector_helper(fun.fun, nrow(df))
if isempty(col_idx)
if fun isa ByRow
return _empty_selector_helper(fun.fun, nrow(df))
else
return fun()
end
else
cdf = eachcol(df)
return fun(map(c -> cdf[c], col_idx)...)
cols = map(c -> cdf[c], col_idx)
if (fun === +) || fun === ByRow(+) # removing parentheses leads to a parsing error
return reduce(+, cols)
elseif fun === ByRow(min)
return _minmax_row_fast(cols, min)
elseif fun === ByRow(max)
return _minmax_row_fast(cols, max)
else
return fun(cols...)
end
end
end

Expand Down
Loading

0 comments on commit 8287ba7

Please sign in to comment.