-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run findall(rows)
only if rows
are not all true
#2727
Conversation
Addition of _findall(rows::AbstractVector{Bool}) to not run `findall(rows)` if `all(rows .== true)`
Tests for this function added.
I think this is more what you were talking about :) In the first In the second I've tried to comment code a lot as operations are not obvious. In general this solution is sometimes a bit slower in case of random Bellow I attach some benchmarks: BD = OrderedDict(
"T Big" => trues(100000),
"F Big" => falses(100000),
"T64 F64" => [trues(64); falses(64)],
"F64 T64" => [falses(64); trues(64)],
"F80 T100" => [falses(85); trues(100)],
"F256 T32" => [falses(256); trues(32)],
"F260 T32" => [falses(260); trues(32)],
"TF Big" => [trues(100000); falses(100000)],
"FT Big" => [falses(100000);trues(100000)] ,
# some edge cases
"TFT small" => [trues(85); falses(100); trues(85)],
"FTFFT small" => [falses(64 + 32); trues(32); falses(128); trues(32)],
"TFTF small" => [falses(64); trues(64); falses(64); trues(64)],
"TFT small" => [trues(64); falses(10); trues(100)],
"FTF Big" => [falses(8500); trues(100000); falses(65000)],
"TFT Big" => [trues(8500); falses(100000); trues(65000)],
"FTFTFTF Big" => [falses(65000); trues(65000); falses(65000); trues(65000); falses(65000); trues(65000); falses(65000)],
"FTFR small" => [falses(85); trues(100); falses(65); rand([true, false], 20)],
"R Big" => BitVector(rand([true, false], 200000)),
"RF Big" => [BitVector(rand([true, false], 100000)) ; falses(100000)],
"RT Big" => [BitVector(rand([true, false], 100000)) ; trues(100000)],
"FTFR Big" => [falses(65000); trues(65000); falses(65000); rand([true, false], 20000)],
"T256 R100" => [trues(256); rand([true, false], 100)],
"F256 R100" => [falses(256); rand([true, false], 100)],
)
for (l, B) in BD
println("Processing $l")
@show Base.findall(B) == DataFrames._findall(B)
@btime Base.findall($B)
@btime DataFrames._findall($B)
end
|
Random tests made bigger
Could you please also resolve merge conflicts so that we can run the CI? Thank you! |
With small number of random true/false generated, there is probability of getting `UnitRange{Int}` insted of `Vector{Int}`
|
I don't know how it works but I've checked coverage of line 223 and it is covered in multiple tests with test |
Yes - I have checked this also :). It seems that probably the compiler somehow optimizes these lines out (not sure though). |
@@ -96,7 +96,7 @@ Base.@propagate_inbounds function SubDataFrame(parent::DataFrame, rows::Abstract | |||
throw(ArgumentError("invalid length of `AbstractVector{Bool}` row index " * | |||
"(got $(length(rows)), expected $(nrow(parent)))")) | |||
end | |||
return SubDataFrame(parent, findall(rows), cols) | |||
return SubDataFrame(parent, _findall(rows), cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test performance and corrected of view
(it should be OK, but just to be sure as this is the point of the change we make so I prefer to double check). Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll check it with some more scenario than in the beginning of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately it is faster :)
df = DataFrame(
a = rand(1000000),
b = allowmissing(rand(1000000)),
c = [allowmissing(rand(1000000 - 1)); missing],
d = repeat([1:9; missing], 100000),
e = repeat([rand(50000); missings(50000)], 10),
f = [missings(300000); ones(400000); missings(300000)],
g = missings(1000000)
)
Main:
julia> @btime dropmissing(df, :a, view=true);
1.872 ms (9 allocations: 7.75 MiB)
julia> @btime dropmissing(df, :b, view=true);
2.075 ms (14 allocations: 7.75 MiB)
julia> @btime dropmissing(df, :c, view=true);
2.111 ms (14 allocations: 7.75 MiB)
julia> @btime dropmissing(df, :d, view=true);
1.913 ms (14 allocations: 6.99 MiB)
julia> @btime dropmissing(df, :e, view=true);
1.117 ms (14 allocations: 3.94 MiB)
julia> @btime dropmissing(df, :f, view=true);
909.713 μs (14 allocations: 3.18 MiB)
julia> @btime dropmissing(df, :g, view=true);
152.216 μs (13 allocations: 126.67 KiB)
julia> @btime dropmissing(df, view=true);
1.432 ms (38 allocations: 1.63 MiB)
Main + this PR:
julia> @btime dropmissing(df, :a, view=true);
17.093 μs (7 allocations: 122.34 KiB)
julia> @btime dropmissing(df, :b, view=true);
162.656 μs (12 allocations: 126.59 KiB)
julia> @btime dropmissing(df, :c, view=true);
168.988 μs (12 allocations: 126.59 KiB)
julia> @btime dropmissing(df, :d, view=true);
1.890 ms (14 allocations: 6.99 MiB)
julia> @btime dropmissing(df, :e, view=true);
1.100 ms (14 allocations: 3.94 MiB)
julia> @btime dropmissing(df, :f, view=true);
166.613 μs (12 allocations: 126.59 KiB)
julia> @btime dropmissing(df, :g, view=true);
188.653 μs (13 allocations: 126.67 KiB)
julia> @btime dropmissing(df, view=true);
1.410 ms (38 allocations: 1.63 MiB)
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@nalimilan - after all my comments are resolved and CI passes could you please have a look (I have checked the logic of the whole code in detail and we have good test coverage so it is rather safe, but you always have a keen eye for problems). I am OK to merge the PR. @pstorozenko - a very good PR |
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I haven't checked the logic as the body of the function is scary. Can you just fix the uncovered line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missclick
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan What do you mean by fix the uncovered line? |
I have also checked the logic that the important lines are covered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thank you! |
PR that resulted from #2724 disussion.
Addition of
_findall(rows::AbstractVector{Bool})
to not run
findall(rows)
ifall(rows .== true)
infunction SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols)
I'm not sure if
src/other/utils.jl
is a good location for this function.Benchmarks:
Main branch
Main + this fix
After
completecases
fix #2726After
completecases
fix #2726 + this fixSummary:
completecases
to process only missingable columns #2726 Reduces time ofdropmissing(df, view=true)
dropmissing(df, :a, view=true)
. In the worst case scenario (as with:c
column) where there is only one missing in the end it is a bit slower, but I think that it is very rarely case. Usually we'll find missings/falses earlier.