-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns #1118
Conversation
…iaData#1042) Add compatibility with pre-contrasts ModelFrame constructor
…ise for speed improvement (JuliaData#1070)
Completely remove support for DataArrays.
This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.
groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.
Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.
Not sure what I meant by this. If it was really serious, we'll discover it sooner or later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
For now, preserve the current semantics: conversion to NullableArray does not happen via insert!().
Again a broader issue which doesn't particularly affect DataFrames. Cf. JuliaStats/NullableArrays.jl#143
Better handle that separately.
Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.
This depends on a CategoricalArrays change by which levels are sorted when creating the array.
There's no inconsistency here: when the input is a Matrix, there's no point in returning a NullableArray. Anyway, these are test methods.
We don't have to handle this right now.
Keep this in DataFrames for now, renaming it to the more explicit sharepools(). Also relax signatures to accept non-Nullable categorical arrays.
These were not exercized by the tests, and the use case for them isn't obvious. (They were formerly methods of DataArrays.PooledDataArray().)
For NullableArrays, even current git master is not enough at this time.
Tests pass, but the Nullable{Any} results could be annoying for users.
New type merging NominalArray and OrdinalArray in 0.0.5.
These shouldn't live in DataFrames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it would be good to remove special-casing and rely only on vcat
.
if haskey(df, colnam) | ||
copy!(col, i, df[colnam]) | ||
nrows = sum(nrow, dfs) | ||
for colnam in unique([(names(e) for e in dfs)...;]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid splatting on an arbitrary number of arguments when possible. Here, better use unique(Base.flatten(names.(df)))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't know about Base.flatten
. And I get to use the new cool f.(x)
-notation!
nrows = sum(nrow, dfs) | ||
for colnam in unique([(names(e) for e in dfs)...;]) | ||
k = Bool[haskey(e, colnam) for e in dfs] | ||
c = vcat((e[colnam] for e in view(dfs, k))...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
view
is probably not worth it as dfs
is just a short vector of references. Could use a if
guard instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, could you show me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like vcat((dfs[i][colnam] for i in 1:length(dfs) if k[i])...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know you could use if
in a for
-generator
if all(k) | ||
col = c | ||
else | ||
col = if _isnullable(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better assign to col
separately within each branch. Also, wrong indentation below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that I should assign to res[colnam]
separately within each branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was just a stylistic comment to avoid col = if
.
test/grouping.jl
Outdated
a[:x] = compact(a[:x]) | ||
b[:x] = compact(b[:x]) | ||
r = vcat(a,b) | ||
@test isequal(r, DataFrame(x=[categorical(1:200);categorical(100:300)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add spaces after commas and semicolons here and below.
copy!(col, i, df[colnam]) | ||
nrows = sum(nrow, dfs) | ||
for colnam in unique([(names(e) for e in dfs)...;]) | ||
k = Bool[haskey(e, colnam) for e in dfs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could find a more explicit/logical names than k
, e
and c
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still could use more explicit names.
i = 1 | ||
j = 1 | ||
for df in dfs | ||
if haskey(df, colnam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could reuse k
here.
j = 1 | ||
for df in dfs | ||
if haskey(df, colnam) | ||
copy!(col, i, view(c, j:j+nrow(df)-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need view
as copy!
can take offsets and number of elements to copy directly.
col = c | ||
else | ||
col = if _isnullable(c) | ||
similar(c, nrows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choosing the column type that way isn't correct. Until JuliaLang/julia#18472 is fixed, I think we should be able to find out what's the most appropriate return type by calling Base.return_types
on vcat
and the types of the input columns. If inference fails, we could fall back to allocating an empty Array
of the required length when the column is missing, and calling vcat
on all columns; it would be wasteful, but that would only happen with broken types (and until we have a better mechanism).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. vcat
decides the column type, I'm just allocating a larger array. If vcat settled on something that couldn't cope with nulls I add that. Maybe I'm missing the point and you're actually disagreeing with the whole wasteful approach of allocating columns "twice"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misread the code. So indeed that's correct, just suboptimal. You could use Base.return_types
as I described to avoid doing this when inference works, and only fall back to it when it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you're suggesting?
cols = (dfs[i][colnam] for i in 1:length(dfs) if k[i])
T = Base.return_types(vcat, Base.typesof(c...)
if T <: Union
# implementation so far in this PR
r = vcat(cols)
...
else
T = makeTsupportnull(T)
r = T(N)
# don't use vcat and just do `copy!` to fill r?
...
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, more or less, though in the second branch, I think I'd call vcat
on existing columns as you do now for consistency. Also, you'll have to take the first result from return_types
, and T<:Union
should be isleaftype(T)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do r = vcat(cols)
in both branches of if isleafttype(T)
they become identical so you don't need the branch in the first place. No?
I thought the problem was that you wanted some other way to allocate an array with a nullable eltype? Is that related? What benefit does return_types
have over similar
?
I don't know what you want to achieve so my questions become rather random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear. The goal is to avoid an unneeded allocation. My idea was that if isleaftype(T)
, then you can create the (empty) final array, and fill it directly with copy!
. To be sure you can have missing values, add NullableArray{promote_type(eltypes.(c)...)}
to the types of existing columns.
If !isleaftype(T)
, then the only way to find out the type of that array is to create it by actually calling vcat
. Since you need to do that, you can as well create empty NullableArrays
of the needed length and add them to the vcat
call so that it gives the final array directly. But honestly, I don't expect this branch to happen often in practice; we could almost raise an error.
Does that make sense?
Implemented your comments, thanks. You're right that the return type is mostly stable. Meaning we'll only use I only found one example in the unit tests where this fails though and
This is an example where the return type of |
This PR depends on JuliaStats/NullableArrays.jl#152 |
Thanks, will give a longer look later.
Sounds like we should change this to be type-stable. I guess the problem is in the method defined in CategoricalArrays.jl? |
@nalimilan the example where the return type of vcat depend on the order of arguments comes from |
Related; CategoricalArrays might need a special |
end | ||
res | ||
end | ||
c = ((typeof(dfs[i][colnam]) for i in 1:length(dfs) if k[i])...,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing comma isn't needed, right?
end | ||
if length(C)==1 && isleaftype(C[1]) | ||
if _isnullable(C[1]) | ||
NC = C[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Four-space indent.
end | ||
|
||
_isnullable{T}(::AbstractArray{T}) = T <: Nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better define these above since they're used there.
OK, I guess that's good enough for now. Looks like it's going to be changed pretty soon by https://github.com/JuliaLang/julia/pull/16740/files#diff-2264bb51acec4e7e2219a3cb1c733651R1186. Though we'll still need a mechanism to promote the input types (JuliaLang/julia#18472). Since you're now familiar with
It seems that indeed CategoricalArray also likely need specific methods so that concatenation with |
Since we don't handle |
end | ||
i += size(df, 1) | ||
nrows = sum(nrow, dfs) | ||
for colnam in unique(Base.flatten(names.(dfs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flatten
lives under Base.Iterators
on Julia 0.6. You can just test this using isdefined
and do using Base: flatten
or using Base.Iterators: flatten
at the top of the file.
end | ||
end | ||
else | ||
# warn("Unstable return types: ", C, " from vcat of ", [typeof(dfs[i][colnam]) for i in 1:length(dfs) if k[i]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this once it's ready?
OK. I'd rather implement There's still a failure which appears to be real (besides the |
Is there anything blocking here? |
@nalimilan If you mean blocking a merge, your review still says "requested changes" and all checks failed on all platforms... |
I just wondered whether there were hard issues to tackle to get this to a mergeable state. |
@gustafsson Can you revise this so that we merge it? |
Actually, this cannot be merged until Base provides a mechanism for |
@cjprybol Do you think we still need this, or did you implement it in DataTables? (See the last four commits.) |
I think everything here should be covered by what we merged in JuliaData/DataTables.jl#45. And until something like JuliaLang/julia#20815 is implemented in Base, we can't rely on Base.vcat anyway, so this looks like it follows similar logic to what we have implemented currently. |
Thanks for your efforts here @gustafsson! Sorry it got left behind while we were experimenting in the DataTables package |
This creates unnecessary copies of columns that doesn't exist in all concatenated dataframes. But uses Base.vcat to let each array type decide what vcat means.
Related #990