Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix float grouping #2791

Merged
merged 15 commits into from
Jun 28, 2021
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

## Bug fixes

* fix bug in how `groupby` handles grouping of float columns
([#2791](https://github.com/JuliaData/DataFrames.jl/pull/2791))
* fix bug in how `issorted` handles custom orderings and improve performance
of sorting when complex custom orderings are passed
([#2746](https://github.com/JuliaData/DataFrames.jl/pull/2746))
Expand Down
26 changes: 22 additions & 4 deletions src/groupeddataframe/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -143,10 +143,28 @@ function refpool_and_array(x::AbstractArray)
else
return nothing, nothing
end
elseif x isa AbstractArray{<:Union{Real, Missing}} &&
all(v -> ismissing(v) | isinteger(v), x) &&
!isempty(skipmissing(x))
minval, maxval = extrema(skipmissing(x))
elseif x isa AbstractArray{<:Union{Real, Missing}} && length(x) > 0
if !(x isa AbstractArray{<:Union{Integer, Missing}})
if !all(v -> (ismissing(v) | isinteger(v)) & (!isequal(v, -0.0)), x)
return nothing, nothing
end
end
if Missing <: eltype(x)
smx = skipmissing(x)
y = iterate(smx)
y === nothing && return nothing, nothing
(v, s) = y
minval = maxval = v
while true
y = iterate(smx, s)
y === nothing && break
(v, s) = y
maxval = max(v, maxval)
minval = min(v, minval)
end
else
minval, maxval = extrema(x)
end
ngroups = big(maxval) - big(minval) + 1
# Threshold chosen with the same rationale as the row_group_slots refpool method:
# refpool approach is faster but we should not allocate too much memory either
Expand Down
8 changes: 8 additions & 0 deletions test/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3710,6 +3710,14 @@ end
end
end

@testset "grouping floats" begin
@test length(groupby_checked(DataFrame(a=[0.0, -0.0]), :a)) == 2
@test getindex.(keys(groupby_checked(DataFrame(a=[3.0, 2.0, 0.0]), :a)), 1) ==
[0, 2, 3]
@test getindex.(keys(groupby_checked(DataFrame(a=[3.0, 2.0, -0.0]), :a)), 1) ==
[3, 2, 0]
end

@testset "aggregation with matrix of Pair" begin
df = DataFrame(a=["a", "b","a", "b"], x=1:4, y=11:14)
gdf = groupby_checked(df, :a)
Expand Down