Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify push!, append! and vcat implementation. #2032

Closed
rapus95 opened this issue Dec 2, 2019 · 6 comments · Fixed by #2152
Closed

Unify push!, append! and vcat implementation. #2032

rapus95 opened this issue Dec 2, 2019 · 6 comments · Fixed by #2152

Comments

@rapus95
Copy link

rapus95 commented Dec 2, 2019

I intend to rewrite the internal code that does the merging of a DataFrame with another DataFrame, DataFrameRow, NamedTuple or AbstractDict.
Idea is as follows:

  • add the capabillity to specify some sort of maximum column-type widening -> promotetype
  • add a few cols column merging algorithms
  • push!, append! and vcat are just front end redirectors which call into the following functions
  • work consists of two functions:
    • infersignature(signatureA, signatureB; cols, promotetype)
    • __merge!(df, others..., newsignature)

Details:

infersignature(signatureA::Signature, signatureB::Signature; cols::Symbol, promotetype::Type)

this infers the signature of the resulting DataFrame from the provided signatures, the cols keyword
and the promotetype according to the following rules:

  1. combine the column names according to cols keyword, including:
    • :orderequal: names and order must be the same, error otherwise
    • :setequal: names can appear in any order but must be present in both signatures, error otherwise
    • :subset: the names of A must be a strict subset of (or equal to) the names of B, error otherwise
    • :superset: the names of A must be a strict superset of (or equal to) the names of B, error otherwise
    • :left: take the names of A
    • :right: take the names of B
    • :intersect: take the names that are present in both signatures
    • :union: take the names that are present in at least one signature
  2. for all column names of the new signature, promote the column types together with the promotetype keyword. It'll roughly works like this:
function promote(tA::Type, tB::Type, promotetype::Type)
  prom = promote_type(tA,tB)
  prom <: Union{promotetype,tA} && return prom
  return tA
end
  1. return the resulting signature

__merge!(df::DataFrame, others..., newsignature::Signature)

this merges all arguments into the first argument, resulting in the signature provided.
Rules:

  1. 0-column dataframes act as a true neutral element
  2. 0-row dataframes act as a type signature provider (implemented in infersignature by manipulating the signature inference step)
  3. Each column of df whose type fits the corresponding requested type is reused and resized (hence in-place). Otherwise a new column is allocated and filled.
  4. In case of any conversion errors, we reverse resizing (thus removing newly added entries) for all columns.
  5. Once we have all columns that will be needed, we replace and reorder the columns in the provided dataframe to mirror the requested signature.

Ref: #1982, #1991

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

OK - so I am closing all other issues related to this. Can you please summarize your proposal here, so that it is easy to keep track of it? Thank you!

As noted earlier - the changes will be non breaking, nevertheless, if we can make it - I think it would be great to have them in 1.0 release.

@rapus95
Copy link
Author

rapus95 commented Dec 2, 2019

OK - so I am closing all other issues related to this. Can you please summarize your proposal here, so that it is easy to keep track of it? Thank you!

As noted earlier - the changes will be non breaking, nevertheless, if we can make it - I think it would be great to have them in 1.0 release.

Summary done :D
Edit: Feedback wanted 😁 e.g. are there major design flaws which I didn't see?

@nalimilan
Copy link
Member

  1. for all column names of the new signature, promote the column types together with the promotetype keyword. It'll roughly works like this:
function promote(tA::Type, tB::Type, promotetype::Type)
  prom = promote_type(tA,tB)
  prom <: Union{promotetype,tA} && return prom
  return tA
end

I think you need to make a distinction between vcat and other concatenation functions: in vcat, all input data frames have the same role when it comes to determining the column types, so prom should always be returned. You can achieve that by passing promotetype=Any

Also :subset and :superset don't really make sense for vcat for the same reason. Note that you assume there are only two input arguments, which isn't the case for vcat and vcat!: rules have to be rephrased a bit to explain what happens when there are more than two arguments.

@bkamins
Copy link
Member

bkamins commented Dec 2, 2019

And as noted on Slack: we need to make push! and append! fast in most common cases (this implies disallowing type promotion to avoid column reallocation). With vcat it is exactly opposite (i.e. we may allow type promotion as in general vcat should be symmetric as columns have to be reallocated anyway).

@rapus95
Copy link
Author

rapus95 commented Dec 2, 2019

  1. for all column names of the new signature, promote the column types together with the promotetype keyword. It'll roughly works like this:
function promote(tA::Type, tB::Type, promotetype::Type)
  prom = promote_type(tA,tB)
  prom <: Union{promotetype,tA} && return prom
  return tA
end

I think you need to make a distinction between vcat and other concatenation functions: in vcat, all input data frames have the same role when it comes to determining the column types, so prom should always be returned. You can achieve that by passing promotetype=Any

Also :subset and :superset don't really make sense for vcat for the same reason. Note that you assume there are only two input arguments, which isn't the case for vcat and vcat!: rules have to be rephrased a bit to explain what happens when there are more than two arguments.

For more than two arguments case currenr implementation just foldls the signatures.
As __merge is an internal function, checking for allowed/useful cols arguments happens in those redirection API functions and promotetype won't be selectable for vcat but instead be fixed to Any.
Regarding fast path, specifying promotetype=Union{} will trigger that somewhat automatically I guess/hope in the current implementation since all paths collapse (except the convert which happens in copyto! and needed)

In the current type promotion style there is only reallocation if the promoted type is different from the allocated vectors. But that is necessary anyway to hold the invariant of being capable of storing tB (if prom<:Union{tA, promotetype})

@bkamins
Copy link
Member

bkamins commented Apr 9, 2020

Merging #2152 will cover this issue in the range expected for 1.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants