-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] deprecate DataFrame constructors #2464
Conversation
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
3a45426
to
acff8d0
Compare
Yes, my view is that it’s best to reduce the number of methods before 1.0.
It can always be added back later.
…On Sat, Oct 3, 2020 at 12:43 PM Bogumił Kamiński ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/deprecated.jl
<#2464 (comment)>
:
> + makeunique=makeunique, copycols=copycols)
***@***.*** DataFrame(columns::NTuple{N, AbstractVector}, cnames::NTuple{N, AbstractString}; makeunique::Bool=false,
+ copycols::Bool=true) where {N} DataFrame(collect(AbstractVector, columns), [Symbol(c) for c in cnames];
+ makeunique=makeunique, copycols=copycols)
***@***.*** DataFrame(columns::NTuple{N, AbstractVector};
+ copycols::Bool=true) where {N} DataFrame(collect(AbstractVector, columns),
+ gennames(length(columns)), copycols=copycols)
***@***.*** DataFrame(columns::AbstractMatrix, cnames::AbstractVector{Symbol} = gennames(size(columns, 2));
+ makeunique::Bool=false) DataFrame(AbstractVector[columns[:, i] for i in 1:size(columns, 2)],
+ cnames; makeunique=makeunique, copycols=false)
+
***@***.*** DataFrame(columns::AbstractMatrix, cnames::AbstractVector{<:AbstractString};
+ makeunique::Bool=false) DataFrame(AbstractVector[columns[:, i] for i in 1:size(columns, 2)],
+ Symbol.(cnames); makeunique=makeunique, copycols=false)
+
+function DataFrame(column_eltypes::AbstractVector{T}, cnames::AbstractVector{Symbol},
@matthieugomez <https://github.com/matthieugomez> - is 👍 for dropping
it? (just to be sure as I tried to be neutral with the question)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2464 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXLHHIREW3HPOSPINKTSI55HFANCNFSM4SAJGAXQ>
.
|
OK - so the decision is (unless I grossly misunderstood something 😄): we keep the current proposal (including deprecating I will wait for 1-2 days in case more comments come in and then update the whole package not to use deprecated functionality (mainly tests). |
It should be good for a review. |
@rofinn - can you please check if with this PR your use case works as expected? |
Yep, looks good to me. I only really use the tables or pairs constructors anyways. 👍 |
Last commit follows the discussion in JuliaData/Tables.jl#204 and fixes embarrassing cases we had:
|
This should be good for a final check before merging. |
src/other/tables.jl
Outdated
if x isa AbstractVector && all(col -> isa(col, AbstractVector), x) | ||
if !Tables.istable(x) && x isa AbstractVector && !isempty(x) | ||
# here we handle the cases that are accepted by standard DataFrame constructors | ||
# but eltype(x) is more flexible than assumed in these methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this comment. Do you mean that here we handle cases where eltype(x)
doesn't match what other (more standard) constructors accept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. What we handle is essentially Any[...]
case (or some weird Union{...}[...]
) as it will not be handled by standard constructors.
It was not very clear earlier and this is essentially what this code path is for only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I think what confused me is "that are accepted by standard DataFrame constructors", since these are not really accepted, or they wouldn't have to be dealt with in this method. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you - as usual - propose a better wording (or what we have now is OK?). Thank you!
test/broadcasting.jl
Outdated
@test (df .+ df) ./ 2 !== df | ||
@test df .+ Matrix(df) == 2 .* df | ||
@test Matrix(df) .+ df == 2 .* df | ||
@test (Matrix(df) .+ df .== 2 .* df) == DataFrame(trues(size(df)), names(df)) | ||
@test (Matrix(df) .+ df .== 2 .* df) == DataFrame(Tables.table(trues(size(df)), header=names(df))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks at these examples, it feels unfortunate to lose the symmetry between Matrix(df)
and DataFrame(mat)
, especially given the fact that broadcasting between data frames and matrices is allowed.
I think it makes sense to have some asymmetry, since Matrix(df)
can just drop names, but DataFrame(mat)
needs to generate names: so requiring DataFrame(mat, colnames)
isn't illegitimate. But DataFrame(Tables.table(df, header=colnames))
really breaks that symmetry (and it's really verbose as we see here). Maybe we could keep DataFrame(mat, colnames)
, since in that case there's no ambiguity (mat
is treated as a matrix, not as a Table.jl object).
(Sorry for suggesting this now. I can adjust the tests if you like.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with this (I can fix the tests - it should be quick this way as we have Tables.table
in all places we need to fix 😄).
@oxinabox + @pdeffebach : can you comment here please?
The proposal is to have DataFrame(::AbstractMatrix, ::Union{AbstractVector{Symbol}, AbstractVector{<:AbstractString})
signature (with no default for the second positional argument, so it would be required).
Then there would be no ambiguity with DataFrame(table)
constructor that can handle matrices as tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, I stand by my claim that every extra constructor adds confusion.
For this case i think nicer than using Tables.jl is:
DataFrame(names .=> eachcol(mat))
that is a nice clear explict statement about what is going on, and is I think clearer than
DataFrame(mat, names)
since we are explictly stating that we are going to use the columns one per name.
Also naturally extends to:
DataFrame(names .=> vec.(eachrow(mat))
for constructing dataframes based on the rows of the matrix.
Particularly relevent as sometimes we do observations as columns, e.g. Distances.jl, the Models.jl interface, some others.
Constructing DataFrames from a matrix is in general a weird thing to do outside of toy examples
because a typical dataframe has a mixture of different types of columns,
and a typical matrix does not.
I don't think we should go out of our way to encourage people writing unrealistic toy examples, by making it easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - @oxinabox has convinced me. Writing DataFrame(names .=> eachcol(mat))
is indeed clean. We have to use Tables.table
for now because eachcol
is not supported in Julia 1.0, but once we move the support to new Julia LTS we can clean it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I find problematic is that by supporting Matrix(df)
and df .+ mat
, we consider that data frames and matrices are objects with compatible shapes. This alone means that we exclude the view of observations as columns (which BTW is the exception in the ecosystem, even in Distances you now have to explicitly give dims
). Yet DataFrame(names .=> eachcol(mat))
conceptually requires you to decompose the matrix into columns to create a data frame. The notion that shapes are compatible is completely lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `DataFrame` constructor is about being a table (which in particular implies having column names)
I don't think this rule holds in general, does it? It's only true for the one-argument constructor (and even in that case, not e.g. for the ::GroupedDataFrame
one). For the two-argument constructor, we don't actually allow tables as the first argument. So it would be OK to accept matrices (just like we accept vectors of vectors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this rule holds in general, does it?
We have to have some minimal set of constructors. For instance I hate DataFrame(a=[1,2], b=[2,3])
constructor as it is inconsistent with the general design we have, but I think we should keep it as it is convenient. How I understand @oxinabox is that he wants to limit the number of special cases as much as possible, and it seems that for matrices DataFrame(names .=> eachcol(matrix))
is an easy enough pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance I hate DataFrame(a=[1,2], b=[2,3]) constructor as it is inconsistent with the general design we have
I was going to suggest deprecating that as well; for DataFrame(:a=>[1,2], ...)
😂
because getting rid of it would mean we are free to add what ever keywords we wanted to the constructor later without clashing with column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this clash already with copycols
and makeunique
kwargs which are handled inconsistently:
julia> DataFrame(makeunique=2, copycols=false)
1×1 DataFrame
│ Row │ makeunique │
│ │ Int64 │
├─────┼────────────┤
│ 1 │ 2 │
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest deprecating that as well; for DataFrame(:a=>[1,2], ...)
Oh, is that an option here? I'm 100% on board with that change 😸
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
…recate_constructors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks mostly good I'd say.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
I have applied all the suggestions. It should be good for another look at. |
CI passes cleanly (I am testing against current master with all the changes made since this PR was started) |
test/constructors.jl
Outdated
@@ -337,4 +344,11 @@ end | |||
end | |||
end | |||
|
|||
@testset "Dict constructor corner case" begin | |||
@test DataFrame(Dict('a' => 1, true => 2)) == DataFrame("a" => 1, "true" => 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really allow this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the original design. Actually I thought that maybe it is best to allow only string or Symbol
as keys. It is last time to change it :). Do we go this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds safer -- as usual. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will change this then. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - now we are as strict as with Pair
constructors (except for sorting when Dict
is passed).
CI passes, only coverage fails |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! I will merge this once the CI passes. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
Hi all, I am late to this discussion.
Reading through the docs and the source code, I was not able to find a suitable replacement: So, my question is if there is a suitable replacement I haven't seen, or if there is a recommended workflow I could implement instead. Or if you could revive at least one constructor that accepts a Type declaration for columns. |
How about this? I think it's decently intuitive
|
or just e.g.:
|
Thanks, that is really helpful. |
Yes only |
broke our code, probably major refactoring needed JuliaData/DataFrames.jl#2464
broke our code, probably major refactoring needed JuliaData/DataFrames.jl#2464
Fixes #2433. Before I change tests and documentation please review if we agree with what is proposed to get deprecated. Thank you.