Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow :union as cols kwarg in push! and append! #2152

Merged
merged 15 commits into from
Apr 16, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 13, 2020

Fixes #2150

@kleinschmidt and @nalimilan - I have implemented what I think is a useful minimal functionality. As usual things were more complex than I have initially expected.

(also comments helping to refine a docstring are welcome, as I usually do not get it right)

@bkamins bkamins added the non-breaking The proposed change is not breaking label Mar 13, 2020
@bkamins bkamins added this to the 1.x milestone Mar 13, 2020
Copy link
Contributor

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only issue is that the widening seems aggressive, but you know the tradeoffs here better than I do. I'd be happy to just have it widen in the case that missings are introduced.

Edit: also needs tests but you probably knew that already ;)

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2020

Only issue is that the widening seems aggressive, but you know the tradeoffs here better than I do. I'd be happy to just have it widen in the case that missings are introduced.

Such considerations were exactly the reason why last time we have decided not to add this functionality, because there were too many options.

My thinking was that if we want something simple then it should be flexible, i.e. if the user passes :union then it is safe to assume that the operation will not fail due to promotion problems (it might fail due to aliasing, but this is a corner case that probably normal users do not have to consider). Therefore you can safely write something like:

df = DataFrame()
for row in some_source_of_rows
    push!(df, row, cols=:union)
end

and expect it will not fail.
So essentially this is like reading a CSV or JSON file row by row and doing automatic detection of column type on the go.

Of course we can be more defensive - as you propose. The question is in what cases you would benefit from push! throwing an error when type promotion would be required (a typical problem is that you have a column of Int (as e.g. source data were rouned to integers) and later there is some Float64 in your data and without automatic promotion push!-ing would fail). Another example is that you initially have a missing in some column that you push!, so you create Vector{Missing} and later - if you have actual data in the column without allowing promotion you will have an error.

A midway solution would be to allow promotion only if promote_type produces a concrete type or Union{concrete_type, Missing} or Union{previous_eltype_of_column,Missing}, and fail if it produces an abstract type.

So in short - I am open to change what I have proposed, and all options have pros and cons unfortunately.

Edit: also needs tests but you probably knew that already ;)

What do you mean by this (I thought I have provided an adequate test coverage). The tests I have written pass, the ❌ is only due to unrelated CI failures.

@kleinschmidt
Copy link
Contributor

I don't see any changes to test files here, just the source...

@pdeffebach
Copy link
Contributor

I would have expected something like

push!(df, x; cols = :union, promote_type = true)

Where cols = :union automatically promotes to missing but errors otherwise, except when promote_type is used.

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2020

@kleinschmidt - thank you (I have forgotten to stage the test file - pushed now 😄)

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2020

@pdeffebach - intuitively I would prefer to keep a single kwarg not to complicate the design and pick the behavior of :union that is most useful.

Given this consideration I preferred to make :union do promotion as this seemed to me "more useful".

In general I think you shoulud use :union if you are unsure what you will get. If you know what you expect to get I think one should construct a DataFrame with all needed columns empty and of correct types and then use cols=:subset in push!.

@pdeffebach
Copy link
Contributor

In general I think you shoulud use :union if you are unsure what you will get.

Sounds good! :union is an appropriate term for this, with its double meaning and all.

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2020

So this was my thinking. But just to make sure we are on the right track. Let us try to think of a case when this kind of promotion in :union would be a bad thing. I.e. in what scenario one would prefer to get an error rather than auto-promotion. I could not come up with such a convincing scenario, but maybe there is some that would change our minds. Thank you for the time you spend on it!

@pdeffebach
Copy link
Contributor

I agree that no one who is too "lazy" to make sure all their named-tuples have the same number of fields is going to care that much about unexpected autopromotion.

The only scenario I can think of is someone who wants auto-promotion of types but doesn't realize their named tuples have different elements. Then they unexpectedly create new columns.

But this seems very unlikely.

  1. The vast majority of use-cases is someone applying a function that returns a named tuple and then pushes to the data frame in a loop. Why would the function return an inconsistent set of fields?

  2. Perhaps they are parsing a string. Then they should use CSV.jl or JSONTables.jl, if they aren't doing that they are probably an advanced user who can write their own function which normalizes fields.

The only other concern would be discoverability. This is solved by good documentation and perhaps a few discourse answers.

@kleinschmidt
Copy link
Contributor

I think the scenario I had in mind was more that someone would use cols=:union once and then expect that auto-promotion is the default for the other options as well.

@kleinschmidt
Copy link
Contributor

But I agree that's unlikely given that CSV.jl/JSON3.jl handle the promotion for you....

@bkamins
Copy link
Member Author

bkamins commented Mar 14, 2020

I think the scenario I had in mind was more that someone would use cols=:union once

Yeah, I was afraid of it too. I even considered adding push!! instead of cols=:union to avoid this duality. But I came to the conclusion that:

  • if someone uses push! one has to read the documentation in order to use cols=:union (as otherwise how can you learn about it)
  • if you do not do cols=:union in push! and get an error - well - then you read the documentation to learn that cols=:union is a special case

Because of this in the initial message in this PR I have made the comment

(also comments helping to refine a docstring are welcome, as I usually do not get it right)

as proper documentation is important here.

The key concern was that push! in Base clearly does not auto-promote and will never do (because it is not able to mutate the type of the container as opposed to the case of a DataFrame). And again - exactly for these reasons we did not implement :union for push! in the past.

So in summary we have the situation:

  1. have a small inconsistency in push! as proposed in current PR which I simply think is a convenient functionality for its use case (reading data from an unknown source) at the cost of a small inconsistency
  2. not have this functionality at all (and here the point of @kleinschmidt initial comment was that he wanted it 😄), or have a separate function for it (like push!! which has a problem that I feared that would be less discoverable by users than an additional option to push!), or further complicate the API of push! with another kwarg (I wanted to avoid this as `push! is already complex as witnessed by its docstring)

I understand that we converge to accepting what is currently implemented in the PR - right?

Thank you for discussing this.

@bkamins bkamins linked an issue Mar 21, 2020 that may be closed by this pull request
@nalimilan
Copy link
Member

Sorry for not commenting earlier. I think I'd rather have a separate promote keyword argument. That makes things simpler and more powerful, as various combinations with cols could make sense. Apart from the ones with cols=:union, I think it can be not too uncommon with the default cols=:setequal too. As @bkamins mentioned, if you start with one row (e.g. when running computations in a loop and aggregating results) and later get missing values (or the reverse), you need promotion, yet it would be weird (and hard to find) to use cols=:union since you don't want to add new columns at all. The same thing can happen if you want to append a new dataset loaded from CSV (for append!).

If we want a flexible mode, we could have another argument for convenience which will set cols=:union, promote=true.

@bkamins
Copy link
Member Author

bkamins commented Mar 26, 2020

What you propose is essentially the initial design that was planned.

It is more complex, but we can go for it. I would simply set promote column by default to different values depending on cols column.

I think it is natural that for :union and :subset it is set to true (to allow population with missing if needed as a common case) and for all other it can be set to false.

If we go for it in a separate PR I can update append! to be synced with push! as it would go along the same logic as you note.

@nalimilan - please confirm that you are OK with this plan and then I can go forward with implementing the functionality.

@bkamins
Copy link
Member Author

bkamins commented Mar 28, 2020

@nalimilan - I have implemented what you proposed. In the end it seems it is not that problematic as I feared. The only issue is that if we have column aliases then with promote=true we will not be able to revert the DataFrame to its original state, but I think it is acceptable.

@bkamins bkamins changed the title allow :union as cols kwarg in push! allow :union as cols kwarg in push! and append! Apr 9, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2020

I have added the same arguments to append! to support #2173. It is left to update the tests now (I will do it soon).

@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2020

@nalimilan - all tests added.

bkamins and others added 5 commits April 12, 2020 23:13
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Apr 12, 2020

Thank you for a huge effort to review this. I hope I have fixed everything.

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2020

CI passes here.

@bkamins bkamins merged commit 580c421 into JuliaData:master Apr 16, 2020
@bkamins bkamins deleted the union_in_push branch April 16, 2020 12:37
@bkamins
Copy link
Member Author

bkamins commented Apr 16, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
4 participants