Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement permutedims #2447

Merged
merged 29 commits into from
Oct 16, 2020
Merged

Implement permutedims #2447

merged 29 commits into from
Oct 16, 2020

Conversation

kescobo
Copy link
Contributor

@kescobo kescobo commented Sep 21, 2020

Figured I'd get a jumpstart on Hacktoberfest, and this is something I've wanted in DataFrames for a longtime. This is an attempt to address #2420

todo

  • correct function signature
  • docs
  • tests

So far

I implemented 3 versions of transpose:

  • transpose1 copies @tbeason 's stack/unstack method from the issue above
  • transpose2 creates a Matrix from everything except the indexing column, and does permutedims on it
  • transpose3 builds column-by-column by iterating rows

Benchmarking

julia> builddf(n) = DataFrame(a = [randstring(6) for _ in 1:n], b = rand(n), c = rand(Int, n), d = repeat(["x", "y"], inner=n÷2))
builddf (generic function with 1 method)

julia> df1 = builddf(10);

julia> df2 = builddf(100);

julia> df3 = builddf(1000);

julia> @btime DataFrames.transpose1($df1);
  33.565 μs (398 allocations: 32.61 KiB)

julia> @btime DataFrames.transpose2($df1);
  12.637 μs (179 allocations: 17.81 KiB)

julia> @btime DataFrames.transpose3($df1);
  14.723 μs (172 allocations: 16.58 KiB)

julia> @btime DataFrames.transpose1($df2);
  144.708 μs (1083 allocations: 115.92 KiB)

julia> @btime DataFrames.transpose2($df2);
  50.667 μs (650 allocations: 69.75 KiB)

julia> @btime DataFrames.transpose3($df2);
  137.353 μs (1261 allocations: 119.12 KiB)

julia> @btime DataFrames.transpose1($df3);
  1.168 ms (7924 allocations: 976.30 KiB)

julia> @btime DataFrames.transpose2($df3);
  469.847 μs (5678 allocations: 643.34 KiB)

julia> @btime DataFrames.transpose3($df3);
  1.513 ms (20035 allocations: 1.27 MiB)

First impressions

  • transpose1 (stack/unstack) is slower on small DataFrames but scales well.
    • This also has the possible disadvantage that the columns end up sorted, rather than preserving the row order
  • transpose2 is the fastest, but might be harder to fix type issues (see below)
  • transpose3 is pretty fast for small DataFrames but doesn't scale well at all

Other thoughts

All 3 versions end up with Any typed columns, regardless of what's in the resulting column (unless all the columns are the same type). This seems like it might be more easily fixed with 1 and 3. The behavior is also not quite what I was expecting if all of the columns are numerical:

julia> df_types = DataFrame(a=[:x, :y, :z], b=rand(3), c=["foo", 2.2, 1])
3×3 DataFrame
│ Row │ a      │ b        │ c   │
│     │ Symbol │ Float64  │ Any │
├─────┼────────┼──────────┼─────┤
│ 1   │ x      │ 0.633028 │ foo │
│ 2   │ y      │ 0.388354 │ 2.2 │
│ 3   │ z      │ 0.134275 │ 1   │

julia> DataFrames.transpose1(df_types)
2×4 DataFrame
│ Row │ variable │ x        │ y        │ z        │
│     │ String   │ Any      │ Any      │ Any      │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ b        │ 0.633028 │ 0.388354 │ 0.134275 │
│ 2   │ c        │ foo      │ 2.2      │ 1        │

julia> DataFrames.transpose2(df_types)
2×4 DataFrame
│ Row │ variable │ x        │ y        │ z        │
│     │ String   │ Any      │ Any      │ Any      │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ b        │ 0.633028 │ 0.388354 │ 0.134275 │
│ 2   │ c        │ foo      │ 2.2      │ 1        │

julia> DataFrames.transpose3(df_types)
2×4 DataFrame
│ Row │ variable │ x        │ y        │ z        │
│     │ String   │ Any      │ Any      │ Any      │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ b        │ 0.633028 │ 0.388354 │ 0.134275 │
│ 2   │ c        │ foo      │ 2.2      │ 1        │

julia> df_floats = DataFrame(a=[:x, :y, :z], b=rand(3), c=rand(3))
3×3 DataFrame
│ Row │ a      │ b        │ c        │
│     │ Symbol │ Float64  │ Float64  │
├─────┼────────┼──────────┼──────────┤
│ 1   │ x      │ 0.941859 │ 0.35769  │
│ 2   │ y      │ 0.404513 │ 0.237959 │
│ 3   │ z      │ 0.649489 │ 0.255324 │

julia> DataFrames.transpose1(df_floats)
2×4 DataFrame
│ Row │ variable │ x        │ y        │ z        │
│     │ String   │ Float64? │ Float64? │ Float64? │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ b        │ 0.941859 │ 0.404513 │ 0.649489 │
│ 2   │ c        │ 0.35769  │ 0.237959 │ 0.255324 │

julia> DataFrames.transpose2(df_floats)
2×4 DataFrame
│ Row │ variable │ x        │ y        │ z        │
│     │ String   │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ b        │ 0.941859 │ 0.404513 │ 0.649489 │
│ 2   │ c        │ 0.35769  │ 0.237959 │ 0.255324 │

julia> DataFrames.transpose3(df_floats)
2×4 DataFrame
│ Row │ variable │ x        │ y        │ z        │
│     │ String   │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ b        │ 0.941859 │ 0.404513 │ 0.649489 │
│ 2   │ c        │ 0.35769  │ 0.237959 │ 0.255324 │

julia> df_number = DataFrame(a=[:x, :y, :z], b=rand(3), c=rand(Int, 3))
3×3 DataFrame
│ Row │ a      │ b         │ c                   │
│     │ Symbol │ Float64   │ Int64               │
├─────┼────────┼───────────┼─────────────────────┤
│ 1   │ x      │ 0.438549  │ 480681029893134526  │
│ 2   │ y      │ 0.489623  │ 172366260093707646  │
│ 3   │ z      │ 0.0132915 │ 3468722734556270181 │

julia> DataFrames.transpose1(df_number)
2×4 DataFrame
│ Row │ variable │ x          │ y          │ z          │
│     │ String   │ Float64?   │ Float64?   │ Float64?   │
├─────┼──────────┼────────────┼────────────┼────────────┤
│ 1   │ b        │ 0.438549   │ 0.489623   │ 0.0132915  │
│ 2   │ c        │ 4.80681e17 │ 1.72366e17 │ 3.46872e18 │

julia> DataFrames.transpose2(df_number)
2×4 DataFrame
│ Row │ variable │ x          │ y          │ z          │
│     │ String   │ Float64    │ Float64    │ Float64    │
├─────┼──────────┼────────────┼────────────┼────────────┤
│ 1   │ b        │ 0.438549   │ 0.489623   │ 0.0132915  │
│ 2   │ c        │ 4.80681e17 │ 1.72366e17 │ 3.46872e18 │

julia> DataFrames.transpose3(df_number)
2×4 DataFrame
│ Row │ variable │ x          │ y          │ z          │
│     │ String   │ Float64    │ Float64    │ Float64    │
├─────┼──────────┼────────────┼────────────┼────────────┤
│ 1   │ b        │ 0.438549   │ 0.489623   │ 0.0132915  │
│ 2   │ c        │ 4.80681e17 │ 1.72366e17 │ 3.46872e18 │

Still lots to do here, but I thought it would be worth it to get the ball rolling and see if people have any thoughts. I will probably get back to this in earnest next week.

@bkamins bkamins linked an issue Sep 21, 2020 that may be closed by this pull request
@bkamins
Copy link
Member

bkamins commented Sep 21, 2020

Thank you for working on it. To get a safe eltype of the resulting columns using typejoin should be a safe choice I think.

The issues you raise are exactly the things that I expected/feared (i.e. that it would be hard to find the optimal solution)

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Sep 21, 2020
@bkamins bkamins added this to the 1.x milestone Sep 21, 2020
@kescobo
Copy link
Contributor Author

kescobo commented Oct 2, 2020

Aright, doing a bit more on this.

  1. Changed function signature and included destination col argument
  2. Added some other methods to try out. I put some benchmark code in a gist here

Results are for Any type (a), numerical types including Bool (n), and for dataframes where rows have mixed type (some all numerical, some with strings) for different numbers of rows.

This is basically the same as before.

┌ Warning: a10
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:17
  35.006 μs (401 allocations: 32.67 KiB)
  11.873 μs (160 allocations: 16.30 KiB)
  14.772 μs (166 allocations: 16.39 KiB)
┌ Warning: a100
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:21
  136.785 μs (1086 allocations: 115.98 KiB)
  46.247 μs (541 allocations: 58.39 KiB)
  143.145 μs (1255 allocations: 118.94 KiB)
┌ Warning: a1k
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:25
  1.157 ms (7927 allocations: 976.36 KiB)
  428.862 μs (4669 allocations: 533.55 KiB)
  1.529 ms (20029 allocations: 1.27 MiB)
┌ Warning: n10
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:29
  34.351 μs (361 allocations: 32.05 KiB)
  12.244 μs (140 allocations: 15.98 KiB)
  15.950 μs (166 allocations: 16.39 KiB)
┌ Warning: n100
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:33
  132.626 μs (686 allocations: 109.73 KiB)
  45.631 μs (341 allocations: 55.27 KiB)
  151.399 μs (1255 allocations: 118.94 KiB)
┌ Warning: n1k
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:37
  1.076 ms (3927 allocations: 913.86 KiB)
  429.272 μs (2669 allocations: 502.30 KiB)
  1.633 ms (20029 allocations: 1.27 MiB)
┌ Warning: mr10
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:41
  33.293 μs (407 allocations: 32.77 KiB)
  12.441 μs (160 allocations: 16.30 KiB)
  15.000 μs (166 allocations: 16.39 KiB)
┌ Warning: mr100
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:45
  135.343 μs (1121 allocations: 116.53 KiB)
  46.999 μs (541 allocations: 58.39 KiB)
  143.275 μs (1255 allocations: 118.94 KiB)
┌ Warning: mr1k
└ @ Main ~/repos/DataFramesDev/benchmarks.jl:49
  1.191 ms (8261 allocations: 981.58 KiB)
  438.124 μs (4669 allocations: 533.55 KiB)
  1.546 ms (20029 allocations: 1.27 MiB)

A couple of new observations:

  1. with @nalimilan 's suggestion for copycols for the permuted matrix, there's a moderate speed-up. Though I had to remove the inner copycols argument (for the constructor with the matrix), as it was causing a method error
julia> @btime DataFrames.transpose2($mr1k, copycols=false);
  425.934 μs (4669 allocations: 533.55 KiB)

julia> @btime DataFrames.transpose2($mr1k, copycols=true);
  459.766 μs (5672 allocations: 643.16 KiB)

With both copycols:

ERROR: MethodError: no method matching DataFrame(::Matrix{Any}, ::Vector{String}; copycols=true)
Closest candidates are:
  DataFrame(::AbstractMatrix{T} where T, ::AbstractVector{var"#s25"} where var"#s25"<:AbstractString; makeunique) at /home/kevin/repos/DataFramesDev/dev/DataFrames.jl/src/dataframe/dataframe.jl:272 got unsupported keyword argument "copycols"
  1. I made a comprehension version of both transpose2 and transpose3 to help with types for mixed row columns. Not surprisingly, this slows down t2 (though it's still faster than t3), but I was surprised that it actually speeds up t3. That is, doing Vector(row) is slower than [x for x in row]
julia> @btime DataFrames.transpose2_comprehension($mr100);
  113.745 μs (1659 allocations: 116.34 KiB)

julia> @btime DataFrames.transpose2($mr100);
  48.217 μs (541 allocations: 58.39 KiB)

julia> @btime DataFrames.transpose3($mr100);
  144.665 μs (1255 allocations: 118.94 KiB)

julia> @btime DataFrames.transpose3_comprehension($mr100);
  125.924 μs (2356 allocations: 168.48 KiB)

Both give columns of Real when the row has integers and floats, and Any when there are Strings in the row.

I'm not totally clear on where to add the promote or typejoin calls for transpose2. I tried a couple of places, and it doesn't seem to alter the output type of the columns. The promotion already seems to happen with the Matrix conversion, and can't be fixed if there are Strings in the matrix:

julia> n10
10×4 DataFrame
│ Row │ a      │ b         │ c                    │ d    │
│     │ String │ Float64   │ Int64                │ Bool │
├─────┼────────┼───────────┼──────────────────────┼──────┤
│ 1   │ a17Wik │ 0.356758  │ -8986863187679119824 │ 0    │
│ 2   │ os6kQD │ 0.354969  │ -1901978611141674116 │ 0    │
│ 3   │ PoiIab │ 0.769343  │ 4759929156393762244  │ 1    │
│ 4   │ 2XDnn1 │ 0.0322987 │ 2546570589132666494  │ 0    │
│ 5   │ ZIx6QI │ 0.560122  │ 6388359009763183408  │ 1    │
│ 6   │ MO1iLD │ 0.194934  │ -1409080130674345575 │ 0    │
│ 7   │ zrhAak │ 0.981582  │ -2685183505188918570 │ 0    │
│ 8   │ RZtUU1 │ 0.61182   │ -3456256066328160472 │ 1    │
│ 9   │ eoMStL │ 0.024695  │ 8468153461503185050  │ 1    │
│ 10  │ rar8je │ 0.324927  │ -983327689695525853  │ 1    │

julia> eltype(Matrix(n10[!, Not(:a)]))
Float64

I had some ideas with promote, but it throws if there are any strings encountered, and checking for mismatched types seems like it would eat any performance benefits of the permuted matrix approach.

julia> promote([1, 1., "str"]...)
ERROR: promotion of types Int64, Float64 and String failed to change any arguments

Very possible I'm missing something.

@bkamins
Copy link
Member

bkamins commented Oct 2, 2020

The promotion already seems to happen with the Matrix conversion

Yes. To avoid promotion you would have to write something like Matrix{Any}(...).

So the conclusion is that Matrix is fastest, right? Then I would go for it and add an additional kwarg promote_type::Bool. If yes - then call Matrix (and then comprehension is not needed), if false then call Matrix{Any} (and use comprehension). Just please make sure to pass Tables.table wrapped matrix to the constructor (cf. #2464)

In general an input from @nalimilan would be welcome on naming kwargs in this function (he is far superior than me in such things).

@kescobo
Copy link
Contributor Author

kescobo commented Oct 2, 2020

So the conclusion is that Matrix is fastest, right?

Yes, by a fair amount.

Then I would go for it and add an additional kwarg promote_type::Bool. If yes - then call Matrix (and then comprehension is not needed), if false then call Matrix{Any} (and use comprehension).

This makes sense to me. Do you have a sense of whether the default for promotion should be true or false?

@bkamins
Copy link
Member

bkamins commented Oct 2, 2020

Do you have a sense of whether the default for promotion should be true or false?

Not sure - @nalimilan seemed to prefer true, but for me intuitively false would be safer.

Having said that I would go for true as I assume that most of the time when you transpose the data frame holds homogenous data in the transformed columns.

@kescobo
Copy link
Contributor Author

kescobo commented Oct 2, 2020

Open questions

  1. name and default for promote_type kwarg
  2. src_colname and dest_colname should be positional, right?
  3. I'm also wondering if dest_colname is necessary. I'd be in favor of having this be the name of the column that gets indexed on, and have users need to explicitly rename
  4. Should copycols always be false? The matrix gets allocated either way, and so there's no safety issue, right?

I wrote some docs and tests, but will wait to finalize until the above are answered. Any additional feedback appreciated

@kescobo
Copy link
Contributor Author

kescobo commented Oct 2, 2020

I'm getting some weird segfaults trying to test locally that don't appear to be happening on CI. Is it possible that Revise is causing issues?

@bkamins
Copy link
Member

bkamins commented Oct 3, 2020

I'd be in favor of having this be the name of the column that gets indexed on, and have users need to explicitly rename.

It is necessary as the name of src_colname might conflict with the column names automatically generated. Assume src_colname has name :a and contains a vector [:a, :b, :c] for instance.

Is it possible that Revise is causing issues?

Not sure - it should not be an issue.

@kescobo
Copy link
Contributor Author

kescobo commented Oct 3, 2020

Great points, thanks for review! I will try to address early next week 👍

@nalimilan
Copy link
Member

Note that in theory promote_type::Bool isn't enough to cover all cases, as several behaviors are possible:

  • choose the element type by calling promote_type on the column's eltypes
  • choose the element type by calling typejoin on the column's eltypes
  • choose the element type by calling collect or a comprehension on the columns to take into account the type of the values they actually contain (this calls Base.promote_typejoin under the hood; we don't have a way to do the same thing using promote_type at the moment)

I'm not sure how useful the second option could be in practice, but we could consider supporting it too, for example by allowing to pass any function to compute the type to an argument called eltype. Not sure how the last behavior would be specified, but many options are possible (symbol, nothing, missing, collect...).

@bkamins
Copy link
Member

bkamins commented Oct 3, 2020

Not sure how the last behavior would be specified, but many options are possible

So maybe let us make a promote::Symbol kwarg, now implement what we feel we know we want, and this will allow to add more options later?

@bkamins
Copy link
Member

bkamins commented Oct 3, 2020

I'd be more included to overload permutedims

This is what I thought you would prefer (that is why I have raised it). Let us then switch to permutedims (though it is a pity that transform cannot be used).

@kescobo
Copy link
Contributor Author

kescobo commented Oct 5, 2020

Let us then switch to permutedims (though it is a pity that transform cannot be used).

Agree it's a pity, but makes sense. Main downside from my perspective is discoverability - what do we think about defining something like:

Base.transpose(::AbstractDataFrame, args...; kwargs...) = MethodError("transpose not defined for AbstractDataFrames. Try permutedims instead")

@bkamins
Copy link
Member

bkamins commented Oct 5, 2020

I think it is a good idea.

m = permutedims(Matrix(df_notsrc))
df_tmp = rename!(DataFrame(Tables.table(m)), df[!, src_namescol], makeunique=makeunique)
end
return hcat!(df_permuted, df_tmp, copycols=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds OK. Though people may be surprised if they do e.g. permutedims(DataFrame(x=[1, 2], y=[2, 3])) and get an error about src_colnames as they didn't specify it. Could be worth having a specific error message for that method.

@bkamins
Copy link
Member

bkamins commented Oct 13, 2020

src_colnames

On a second thought maybe requiring to pass it (like the proposal @nalimilan made for DataFrame constructor with matrix today) would be a better solution. Writing permutedims(df, 1) is not that longer and does not require much more thought than permutedims(df).

Especially that it now occurred to me that in the future we might want permutedims(df) to do something different - e.g. take all columns and generate dummy column names like x1, x2, etc.? What do you think?

@kescobo
Copy link
Contributor Author

kescobo commented Oct 13, 2020

Especially that it now occurred to me that in the future we might want permutedims(df) to do something different - e.g. take all columns and generate dummy column names like x1, x2, etc.? What do you think?

You mean like

permutedims(DataFrame(a=["x", "y, "z"], b=[1, 2, 3])) == DataFrame(x1=["x", 1], x2=["y", 2], x3=["z", 3])

Or should the column names also become a new column?

@bkamins
Copy link
Member

bkamins commented Oct 13, 2020

Or should the column names also become a new column?

I am not sure - this is exactly what I was thinking of and both options seemed plausible. The point is that we might change our minds what is good in the future so it is better to be conservative now (have a look at #2464 where we started discussing things that worked for years and no one complained about them).

@kescobo
Copy link
Contributor Author

kescobo commented Oct 13, 2020

The point is that we might change our minds what is good in the future so it is better to be conservative now (have a look at #2464 where we started discussing things that worked for years and no one complained about them).

TBH I don't 100% follow the relevance of that PR. Are you just saying that in the PR you're working on "fixing" things that have been around for years that people didn't complain about?

Regardless, I definitely agree that being conservative is the right approach. But I'm not sure what the conservative version is in this case. Should we include a keepcolnames::Bool keyword to control this behavior?

Just so I'm clear, the new behavior being proposed is:

  • permutedims(df, [colnames]): permute all columns to rows, give dummy col names unless vector of names is provided. Open question about whether names(df) becomes a row or is lost.
  • permutedims(df, indexcol::ColumnIndex, [dest_colname]): current behavior (that is, pivot on indexcol, option to give a different name to resulting first column)

@bkamins
Copy link
Member

bkamins commented Oct 13, 2020

Are you just saying that in the PR you're working on "fixing" things that have been around for years that people didn't complain about?

No, I am saying that it is hard to get consensus what should be the behaviour of some functions.

@bkamins
Copy link
Member

bkamins commented Oct 13, 2020

Just so I'm clear, the new behavior being proposed is:

What I am considering if we should allow permutedims(df) at all. Maybe at this stage it should be disallowed. We can decide what it means later.

@kescobo
Copy link
Contributor Author

kescobo commented Oct 14, 2020

Ahh, got it. That makes sense. Ok, will take another look tomorrow

@kescobo kescobo changed the title [WIP] Implement permutedims Implement permutedims Oct 15, 2020
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for three places in the documentation where I left comments. Thank you!

@kescobo kescobo mentioned this pull request Oct 15, 2020
@bkamins
Copy link
Member

bkamins commented Oct 15, 2020

@nalimilan - I am OK to merge this.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good apart from a few small suggestions.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member

bkamins commented Oct 15, 2020

I will merge the PR once the CI passes. Thank you!

@kescobo
Copy link
Contributor Author

kescobo commented Oct 16, 2020

Woo! Thank you both for thoughtful and detailed review! Really appreciate it :-)

@bkamins bkamins merged commit e07b08d into JuliaData:master Oct 16, 2020
@bkamins
Copy link
Member

bkamins commented Oct 16, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature hacktoberfest-accepted non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add transpose(df, src_namescol, dst_namescol)
3 participants