Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run findall(rows) only if rows are not all true #2727

Merged
merged 17 commits into from
May 16, 2021

Conversation

pstorozenko
Copy link
Contributor

PR that resulted from #2724 disussion.
Addition of _findall(rows::AbstractVector{Bool})
to not run findall(rows) if all(rows .== true) in
function SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols)

I'm not sure if src/other/utils.jl is a good location for this function.


Benchmarks:

df = DataFrame(
    a = rand(1000000),
    b = allowmissing(rand(1000000)),
    c = [allowmissing(rand(1000000 - 1)); missing],
    d = repeat([1:9; missing], 100000)
)

Main branch

 julia> @btime dropmissing(df, :a, view=true);
  2.040 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :b, view=true);
  2.071 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :c, view=true);
  2.098 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.911 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  7.677 ms (37 allocations: 7.00 MiB)

Main + this fix

julia> @btime dropmissing(df, :a, view=true);
  388.342 μs (13 allocations: 126.64 KiB)

julia> @btime dropmissing(df, :b, view=true);
  399.634 μs (13 allocations: 126.64 KiB)

julia> @btime dropmissing(df, :c, view=true);
  2.103 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.897 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  7.679 ms (37 allocations: 7.00 MiB)

After completecases fix #2726

julia> @btime dropmissing(df, :a, view=true);
  1.883 ms (10 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :b, view=true);
  2.075 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :c, view=true);
  2.095 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.905 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  2.351 ms (28 allocations: 7.12 MiB)

After completecases fix #2726 + this fix

julia> @btime dropmissing(df, :a, view=true);
  251.012 μs (8 allocations: 122.38 KiB)

julia> @btime dropmissing(df, :b, view=true);
  399.373 μs (13 allocations: 126.64 KiB)

julia> @btime dropmissing(df, :c, view=true);
  2.109 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.903 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  2.355 ms (28 allocations: 7.12 MiB)

Summary:

  • Optimize completecases to process only missingable columns #2726 Reduces time of dropmissing(df, view=true)
  • This fix reduces average time for dropmissing(df, :a, view=true). In the worst case scenario (as with :c column) where there is only one missing in the end it is a bit slower, but I think that it is very rarely case. Usually we'll find missings/falses earlier.

Addition of _findall(rows::AbstractVector{Bool})
to not run `findall(rows)` if `all(rows .== true)`
@bkamins bkamins added this to the 1.x milestone Apr 18, 2021
Tests for this function added.
@pstorozenko
Copy link
Contributor Author

I think this is more what you were talking about :)
I took Base.findall(B::BitArray) function and rewrote/adapted it for our needs.

In the first while true we process B in the optimized way.
After it we either returned solution or materialized result I.

In the second while true we process B in almost the same way as in Base.findall(B::BitArray).

I've tried to comment code a lot as operations are not obvious.
I've used one line ifs like Bi == length(Bc) && return start:length(B) because it makes code much cleaner, but I don't know if you like it.
Tell me what do you think about this whole approach.

In general this solution is sometimes a bit slower in case of random B.
This is because Base.findall(B::BitArray) does not try to jump over blocks with trues only.
In case of bigger blocks of trues it is faster.
Of course in case of single, continuous block of trues it is much, much faster.

Bellow I attach some benchmarks:

BD = OrderedDict(
    "T Big" => trues(100000),
    "F Big" => falses(100000),
    "T64 F64" => [trues(64); falses(64)],
    "F64 T64" => [falses(64); trues(64)],
    "F80 T100" => [falses(85); trues(100)],
    "F256 T32" => [falses(256); trues(32)],
    "F260 T32" => [falses(260); trues(32)],
    "TF Big" => [trues(100000); falses(100000)],
    "FT Big" => [falses(100000);trues(100000)] ,

    # some edge cases
    "TFT small" => [trues(85); falses(100); trues(85)],
    "FTFFT small" => [falses(64 + 32); trues(32); falses(128); trues(32)],
    "TFTF small" => [falses(64); trues(64); falses(64); trues(64)],
    "TFT small" => [trues(64); falses(10); trues(100)],

    "FTF Big" => [falses(8500); trues(100000); falses(65000)],
    "TFT Big" => [trues(8500); falses(100000); trues(65000)],
    "FTFTFTF Big" => [falses(65000); trues(65000); falses(65000); trues(65000); falses(65000); trues(65000); falses(65000)],

    "FTFR small" => [falses(85); trues(100); falses(65); rand([true, false], 20)],
    "R Big" => BitVector(rand([true, false], 200000)),
    "RF Big" => [BitVector(rand([true, false], 100000)) ; falses(100000)],
    "RT Big" => [BitVector(rand([true, false], 100000)) ; trues(100000)],
    "FTFR Big" => [falses(65000);  trues(65000);  falses(65000); rand([true, false], 20000)],
    "T256 R100" => [trues(256);  rand([true, false], 100)],
    "F256 R100" => [falses(256); rand([true, false], 100)],
)

for (l, B) in BD
    println("Processing $l")
    @show Base.findall(B) == DataFrames._findall(B)
    @btime Base.findall($B)
    @btime DataFrames._findall($B)
end
Processing T Big
Base.findall(B) == DataFrames._findall(B) = true
  109.098 μs (2 allocations: 781.33 KiB)
  398.537 ns (0 allocations: 0 bytes)
Processing F Big
Base.findall(B) == DataFrames._findall(B) = true
  430.095 ns (1 allocation: 80 bytes)
  421.095 ns (1 allocation: 80 bytes)
Processing T64 F64
Base.findall(B) == DataFrames._findall(B) = true
  104.586 ns (1 allocation: 624 bytes)
  41.959 ns (1 allocation: 80 bytes)
Processing F64 T64
Base.findall(B) == DataFrames._findall(B) = true
  104.734 ns (1 allocation: 624 bytes)
  41.498 ns (1 allocation: 80 bytes)
Processing F80 T100
Base.findall(B) == DataFrames._findall(B) = true
  143.548 ns (1 allocation: 896 bytes)
  46.941 ns (1 allocation: 80 bytes)
Processing F256 T32
Base.findall(B) == DataFrames._findall(B) = true
  73.466 ns (1 allocation: 336 bytes)
  45.835 ns (1 allocation: 80 bytes)
Processing F260 T32
Base.findall(B) == DataFrames._findall(B) = true
  73.387 ns (1 allocation: 336 bytes)
  45.847 ns (1 allocation: 80 bytes)
Processing TF Big
Base.findall(B) == DataFrames._findall(B) = true
  105.403 μs (2 allocations: 781.33 KiB)
  1.467 μs (1 allocation: 80 bytes)
Processing FT Big
Base.findall(B) == DataFrames._findall(B) = true
  105.608 μs (2 allocations: 781.33 KiB)
  1.987 μs (1 allocation: 80 bytes)
Processing TFT small
Base.findall(B) == DataFrames._findall(B) = true
  217.728 ns (1 allocation: 1.45 KiB)
  188.086 ns (2 allocations: 1.41 KiB)
Processing FTFFT small
Base.findall(B) == DataFrames._findall(B) = true
  106.897 ns (1 allocation: 624 bytes)
  112.246 ns (2 allocations: 624 bytes)
Processing TFTF small
Base.findall(B) == DataFrames._findall(B) = true
  215.025 ns (1 allocation: 1.14 KiB)
  160.212 ns (2 allocations: 1.14 KiB)
Processing FTF Big
Base.findall(B) == DataFrames._findall(B) = true
  127.988 μs (2 allocations: 781.33 KiB)
  1.767 μs (1 allocation: 80 bytes)
Processing TFT Big
Base.findall(B) == DataFrames._findall(B) = true
  95.470 μs (2 allocations: 574.33 KiB)
  66.360 μs (2 allocations: 574.39 KiB)
Processing FTFTFTF Big
Base.findall(B) == DataFrames._findall(B) = true
  253.018 μs (2 allocations: 1.49 MiB)
  194.054 μs (2 allocations: 1.49 MiB)
Processing FTFR small
Base.findall(B) == DataFrames._findall(B) = true
  189.834 ns (1 allocation: 1008 bytes)
  153.568 ns (2 allocations: 976 bytes)
Processing R Big
Base.findall(B) == DataFrames._findall(B) = true
  153.825 μs (2 allocations: 778.89 KiB)
  119.013 μs (2 allocations: 778.95 KiB)
Processing RF Big
Base.findall(B) == DataFrames._findall(B) = true
  78.821 μs (2 allocations: 389.64 KiB)
  59.672 μs (2 allocations: 389.70 KiB)
Processing RT Big
Base.findall(B) == DataFrames._findall(B) = true
  204.643 μs (2 allocations: 1.15 MiB)
  165.110 μs (2 allocations: 1.15 MiB)
Processing FTFR Big
Base.findall(B) == DataFrames._findall(B) = true
  100.556 μs (2 allocations: 586.64 KiB)
  72.579 μs (2 allocations: 586.70 KiB)
Processing T256 R100
Base.findall(B) == DataFrames._findall(B) = true
  659.301 ns (1 allocation: 2.56 KiB)
  519.550 ns (2 allocations: 2.58 KiB)
Processing F256 R100
Base.findall(B) == DataFrames._findall(B) = true
  120.717 ns (1 allocation: 544 bytes)
  131.407 ns (2 allocations: 528 bytes)

Random tests made bigger
@bkamins
Copy link
Member

bkamins commented May 9, 2021

Could you please also resolve merge conflicts so that we can run the CI? Thank you!

@pstorozenko
Copy link
Contributor Author

B.chunks is always a Vector{UInt64} in both 32bit and 64bit julia

@pstorozenko
Copy link
Contributor Author

I don't know how it works but I've checked coverage of line 223 and it is covered in multiple tests with test T64 F64 being the prime example.

@bkamins
Copy link
Member

bkamins commented May 10, 2021

Yes - I have checked this also :). It seems that probably the compiler somehow optimizes these lines out (not sure though).

@@ -96,7 +96,7 @@ Base.@propagate_inbounds function SubDataFrame(parent::DataFrame, rows::Abstract
throw(ArgumentError("invalid length of `AbstractVector{Bool}` row index " *
"(got $(length(rows)), expected $(nrow(parent)))"))
end
return SubDataFrame(parent, findall(rows), cols)
return SubDataFrame(parent, _findall(rows), cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test performance and corrected of view (it should be OK, but just to be sure as this is the point of the change we make so I prefer to double check). Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll check it with some more scenario than in the beginning of this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately it is faster :)

df = DataFrame(
    a = rand(1000000),
    b = allowmissing(rand(1000000)),
    c = [allowmissing(rand(1000000 - 1)); missing],
    d = repeat([1:9; missing], 100000),
    e = repeat([rand(50000); missings(50000)], 10),
    f = [missings(300000); ones(400000); missings(300000)],
    g = missings(1000000)
)

Main:

julia> @btime dropmissing(df, :a, view=true);
  1.872 ms (9 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :b, view=true);
  2.075 ms (14 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :c, view=true);
  2.111 ms (14 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.913 ms (14 allocations: 6.99 MiB)

julia> @btime dropmissing(df, :e, view=true);
  1.117 ms (14 allocations: 3.94 MiB)

julia> @btime dropmissing(df, :f, view=true);
  909.713 μs (14 allocations: 3.18 MiB)
 
julia> @btime dropmissing(df, :g, view=true);
  152.216 μs (13 allocations: 126.67 KiB)

julia> @btime dropmissing(df, view=true);
  1.432 ms (38 allocations: 1.63 MiB)

Main + this PR:

julia> @btime dropmissing(df, :a, view=true);
  17.093 μs (7 allocations: 122.34 KiB)

julia> @btime dropmissing(df, :b, view=true);
  162.656 μs (12 allocations: 126.59 KiB)

julia> @btime dropmissing(df, :c, view=true);
  168.988 μs (12 allocations: 126.59 KiB)

julia> @btime dropmissing(df, :d, view=true);
  1.890 ms (14 allocations: 6.99 MiB)

julia> @btime dropmissing(df, :e, view=true);
  1.100 ms (14 allocations: 3.94 MiB)

julia> @btime dropmissing(df, :f, view=true);
  166.613 μs (12 allocations: 126.59 KiB)

julia> @btime dropmissing(df, :g, view=true);
  188.653 μs (13 allocations: 126.67 KiB)
  
julia> @btime dropmissing(df, view=true);
  1.410 ms (38 allocations: 1.63 MiB)

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member

bkamins commented May 10, 2021

@nalimilan - after all my comments are resolved and CI passes could you please have a look (I have checked the logic of the whole code in detail and we have good test coverage so it is rather safe, but you always have a keen eye for problems). I am OK to merge the PR.

@pstorozenko - a very good PR

pstorozenko and others added 2 commits May 10, 2021 22:20
@pstorozenko
Copy link
Contributor Author

Thanks!

@bkamins bkamins requested a review from nalimilan May 11, 2021 18:13
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I haven't checked the logic as the body of the function is scary. Can you just fix the uncovered line?

Copy link
Contributor Author

@pstorozenko pstorozenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missclick

pstorozenko and others added 2 commits May 14, 2021 13:42
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@pstorozenko
Copy link
Contributor Author

pstorozenko commented May 14, 2021

@nalimilan What do you mean by fix the uncovered line?
Do you mean code coverage? I'm sure all lines (except for @assert false "should not be reached") are covered by tests (I tested with good old @show), but for some reason codecov is showing differently.

@bkamins
Copy link
Member

bkamins commented May 14, 2021

but for some reason codecov is showing differently.

I have also checked the logic that the important lines are covered.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bkamins bkamins merged commit ed00241 into JuliaData:main May 16, 2021
@bkamins
Copy link
Member

bkamins commented May 16, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants