Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns #1118

Closed
wants to merge 79 commits into from

Conversation

gustafsson
Copy link
Contributor

This creates unnecessary copies of columns that doesn't exist in all concatenated dataframes. But uses Base.vcat to let each array type decide what vcat means.

Related #990

Gord Stephen and others added 30 commits September 14, 2016 10:13
…iaData#1042)

Add compatibility with pre-contrasts ModelFrame constructor
Completely remove support for DataArrays.
This depends on PRs moving these into NullableArrays.jl.
Also use isequal() instead of ==, as the latter is in Base and
unlikely to change its semantics.
groupby() did not follow the order of levels, and wasn't robust to reordering
levels. Add tests for corner cases.
Use the fallbacks for now, should be added back after
JuliaData/CategoricalArrays.jl#12 is fixed.
Not sure what I meant by this. If it was really serious, we'll discover
it sooner or later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
For now, preserve the current semantics: conversion to NullableArray
does not happen via insert!().
Again a broader issue which doesn't particularly affect DataFrames. Cf.
JuliaStats/NullableArrays.jl#143
Better handle that separately.
Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.
This depends on a CategoricalArrays change by which levels are sorted when
creating the array.
There's no inconsistency here: when the input is a Matrix, there's no
point in returning a NullableArray. Anyway, these are test methods.
We don't have to handle this right now.
Keep this in DataFrames for now, renaming it to the more explicit
sharepools(). Also relax signatures to accept non-Nullable categorical arrays.
These were not exercized by the tests, and the use case for them isn't obvious.
(They were formerly methods of DataArrays.PooledDataArray().)
For NullableArrays, even current git master is not enough at this time.
Tests pass, but the Nullable{Any} results could be annoying for users.
New type merging NominalArray and OrdinalArray in 0.0.5.
These shouldn't live in DataFrames.
@gustafsson gustafsson changed the title Base.vcat AbstractDataFrame should rely on Base.vcat for columns [RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns Nov 1, 2016
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it would be good to remove special-casing and rely only on vcat.

if haskey(df, colnam)
copy!(col, i, df[colnam])
nrows = sum(nrow, dfs)
for colnam in unique([(names(e) for e in dfs)...;])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid splatting on an arbitrary number of arguments when possible. Here, better use unique(Base.flatten(names.(df))).

Copy link
Contributor Author

@gustafsson gustafsson Nov 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know about Base.flatten. And I get to use the new cool f.(x)-notation!

nrows = sum(nrow, dfs)
for colnam in unique([(names(e) for e in dfs)...;])
k = Bool[haskey(e, colnam) for e in dfs]
c = vcat((e[colnam] for e in view(dfs, k))...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

view is probably not worth it as dfs is just a short vector of references. Could use a if guard instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, could you show me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like vcat((dfs[i][colnam] for i in 1:length(dfs) if k[i])...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know you could use if in a for-generator

if all(k)
col = c
else
col = if _isnullable(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better assign to col separately within each branch. Also, wrong indentation below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that I should assign to res[colnam] separately within each branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was just a stylistic comment to avoid col = if.

test/grouping.jl Outdated
a[:x] = compact(a[:x])
b[:x] = compact(b[:x])
r = vcat(a,b)
@test isequal(r, DataFrame(x=[categorical(1:200);categorical(100:300)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add spaces after commas and semicolons here and below.

copy!(col, i, df[colnam])
nrows = sum(nrow, dfs)
for colnam in unique([(names(e) for e in dfs)...;])
k = Bool[haskey(e, colnam) for e in dfs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could find a more explicit/logical names than k, e and c.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still could use more explicit names.

i = 1
j = 1
for df in dfs
if haskey(df, colnam)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could reuse k here.

j = 1
for df in dfs
if haskey(df, colnam)
copy!(col, i, view(c, j:j+nrow(df)-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need view as copy! can take offsets and number of elements to copy directly.

col = c
else
col = if _isnullable(c)
similar(c, nrows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing the column type that way isn't correct. Until JuliaLang/julia#18472 is fixed, I think we should be able to find out what's the most appropriate return type by calling Base.return_types on vcat and the types of the input columns. If inference fails, we could fall back to allocating an empty Array of the required length when the column is missing, and calling vcat on all columns; it would be wasteful, but that would only happen with broken types (and until we have a better mechanism).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. vcat decides the column type, I'm just allocating a larger array. If vcat settled on something that couldn't cope with nulls I add that. Maybe I'm missing the point and you're actually disagreeing with the whole wasteful approach of allocating columns "twice"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misread the code. So indeed that's correct, just suboptimal. You could use Base.return_types as I described to avoid doing this when inference works, and only fall back to it when it fails.

Copy link
Contributor Author

@gustafsson gustafsson Nov 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you're suggesting?

cols = (dfs[i][colnam] for i in 1:length(dfs) if k[i])
T = Base.return_types(vcat, Base.typesof(c...)
if T <: Union
   # implementation so far in this PR
   r = vcat(cols)
   ...
else
   T = makeTsupportnull(T)
   r = T(N)
   # don't use vcat and just do `copy!` to fill r?
   ...
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, more or less, though in the second branch, I think I'd call vcat on existing columns as you do now for consistency. Also, you'll have to take the first result from return_types, and T<:Union should be isleaftype(T).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do r = vcat(cols) in both branches of if isleafttype(T) they become identical so you don't need the branch in the first place. No?

I thought the problem was that you wanted some other way to allocate an array with a nullable eltype? Is that related? What benefit does return_types have over similar?

I don't know what you want to achieve so my questions become rather random.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I wasn't clear. The goal is to avoid an unneeded allocation. My idea was that if isleaftype(T), then you can create the (empty) final array, and fill it directly with copy!. To be sure you can have missing values, add NullableArray{promote_type(eltypes.(c)...)} to the types of existing columns.

If !isleaftype(T), then the only way to find out the type of that array is to create it by actually calling vcat. Since you need to do that, you can as well create empty NullableArrays of the needed length and add them to the vcat call so that it gives the final array directly. But honestly, I don't expect this branch to happen often in practice; we could almost raise an error.

Does that make sense?

@gustafsson
Copy link
Contributor Author

Implemented your comments, thanks. You're right that the return type is mostly stable. Meaning we'll only use vcat to find the array type, but not for concatenating data.

I only found one example in the unit tests where this fails though and return_types(vcat, c) doesn't give a leaf type. From cat.jl:121 Base.return_types get's called like this:

Base.return_types(vcat, (NullableCategoricalArray{Int64,1,UInt32}, NullableArray{Int64,1}))
1-element Array{Any,1}:
 CategoricalArrays.NullableCategoricalArray{T,N,R<:Integer}

This is an example where the return type of vcat depend on the order of arguments. In this case it returns a NullableCategoricalArray{Int64,1,UInt32}.

@gustafsson
Copy link
Contributor Author

This PR depends on JuliaStats/NullableArrays.jl#152

@nalimilan
Copy link
Member

Thanks, will give a longer look later.

This is an example where the return type of vcat depend on the order of arguments. In this case it returns a NullableCategoricalArray{Int64,1,UInt32}.

Sounds like we should change this to be type-stable. I guess the problem is in the method defined in CategoricalArrays.jl?

@gustafsson
Copy link
Contributor Author

@nalimilan the example where the return type of vcat depend on the order of arguments comes from Base.typed_vcat which uses similar on its first argument.

@gustafsson
Copy link
Contributor Author

Related; CategoricalArrays might need a special vcat(V1::CategoricalArray,V::AbstractVector...) that either creates a regular Array or arrange so that a[pos:p1] = Vk in typed_vcat uses the equivalent of CategoricalArrays.copy! somehow. But's that's really an issue for CategoricalArrays not DataFrames.

end
res
end
c = ((typeof(dfs[i][colnam]) for i in 1:length(dfs) if k[i])...,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing comma isn't needed, right?

end
if length(C)==1 && isleaftype(C[1])
if _isnullable(C[1])
NC = C[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four-space indent.

end

_isnullable{T}(::AbstractArray{T}) = T <: Nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better define these above since they're used there.

@nalimilan
Copy link
Member

@nalimilan the example where the return type of vcat depend on the order of arguments comes from Base.typed_vcat which uses similar on its first argument.

OK, I guess that's good enough for now. Looks like it's going to be changed pretty soon by https://github.com/JuliaLang/julia/pull/16740/files#diff-2264bb51acec4e7e2219a3cb1c733651R1186.

Though we'll still need a mechanism to promote the input types (JuliaLang/julia#18472). Since you're now familiar with vcat, would you experiment with a PR against Base? One approach would be to use promote for that; you could add a promote_rule method for AbstractArray (similar to what I recently did for Pair at JuliaLang/julia#19171).

Related; CategoricalArrays might need a special vcat(V1::CategoricalArray,V::AbstractVector...) that either creates a regular Array or arrange so that a[pos:p1] = Vk in typed_vcat uses the equivalent of CategoricalArrays.copy! somehow. But's that's really an issue for CategoricalArrays not DataFrames.

It seems that indeed a[pos:p1] = Vk should be equivalent to copy!, shouldn't it? But could you elaborate on how it's different? As regards handling the levels?

CategoricalArray also likely need specific methods so that concatenation with AbstractArray{<:Nullable} gives a NullableCategoricalArray. Ideally it would be done via the promote mechanism, but currently it seems we need a vcat method.

@gustafsson
Copy link
Contributor Author

gustafsson commented Nov 15, 2016

Since we don't handle setindex!(::CategoricalArray, ::AbstractArray, ::AbstractArray) the call to a[pos:p1] = Vk will call the default implementation in base which in turn will call setindex! for each element. This doesn't use the efficient concatenation implemented in copy! and vcat for CategoricalArray. Might just be a simple fix of implementing a setindex! that just calls copy!. I'm not sure if there are any side-effects to consider.

end
i += size(df, 1)
nrows = sum(nrow, dfs)
for colnam in unique(Base.flatten(names.(dfs)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten lives under Base.Iterators on Julia 0.6. You can just test this using isdefined and do using Base: flatten or using Base.Iterators: flatten at the top of the file.

end
end
else
# warn("Unstable return types: ", C, " from vcat of ", [typeof(dfs[i][colnam]) for i in 1:length(dfs) if k[i]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this once it's ready?

@nalimilan
Copy link
Member

OK. I'd rather implement copy! in terms of setindex!, though it doesn't really matter.

There's still a failure which appears to be real (besides the flatten problem on 0.6).

@nalimilan
Copy link
Member

Is there anything blocking here?

@ararslan
Copy link
Member

@nalimilan If you mean blocking a merge, your review still says "requested changes" and all checks failed on all platforms...

@nalimilan
Copy link
Member

I just wondered whether there were hard issues to tackle to get this to a mergeable state.

@nalimilan
Copy link
Member

@gustafsson Can you revise this so that we merge it?

@nalimilan
Copy link
Member

Actually, this cannot be merged until Base provides a mechanism for cat promotion (JuliaLang/julia#20815). See also discussion at https://github.com/JuliaData/DataTables.jl/pull/30/files#r105499372.

@nalimilan
Copy link
Member

@cjprybol Do you think we still need this, or did you implement it in DataTables? (See the last four commits.)

@cjprybol
Copy link
Contributor

I think everything here should be covered by what we merged in JuliaData/DataTables.jl#45. And until something like JuliaLang/julia#20815 is implemented in Base, we can't rely on Base.vcat anyway, so this looks like it follows similar logic to what we have implemented currently.

@cjprybol
Copy link
Contributor

Thanks for your efforts here @gustafsson! Sorry it got left behind while we were experimenting in the DataTables package

@cjprybol cjprybol closed this Sep 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.