allow :union as cols kwarg in push! and append! #2152

bkamins · 2020-03-13T09:58:53Z

@kleinschmidt and @nalimilan - I have implemented what I think is a useful minimal functionality. As usual things were more complex than I have initially expected.

(also comments helping to refine a docstring are welcome, as I usually do not get it right)

kleinschmidt

Looks good. Only issue is that the widening seems aggressive, but you know the tradeoffs here better than I do. I'd be happy to just have it widen in the case that missings are introduced.

Edit: also needs tests but you probably knew that already ;)

src/dataframe/dataframe.jl

bkamins · 2020-03-13T18:44:57Z

Only issue is that the widening seems aggressive, but you know the tradeoffs here better than I do. I'd be happy to just have it widen in the case that missings are introduced.

Such considerations were exactly the reason why last time we have decided not to add this functionality, because there were too many options.

My thinking was that if we want something simple then it should be flexible, i.e. if the user passes :union then it is safe to assume that the operation will not fail due to promotion problems (it might fail due to aliasing, but this is a corner case that probably normal users do not have to consider). Therefore you can safely write something like:

df = DataFrame()
for row in some_source_of_rows
    push!(df, row, cols=:union)
end

and expect it will not fail.
So essentially this is like reading a CSV or JSON file row by row and doing automatic detection of column type on the go.

Of course we can be more defensive - as you propose. The question is in what cases you would benefit from push! throwing an error when type promotion would be required (a typical problem is that you have a column of Int (as e.g. source data were rouned to integers) and later there is some Float64 in your data and without automatic promotion push!-ing would fail). Another example is that you initially have a missing in some column that you push!, so you create Vector{Missing} and later - if you have actual data in the column without allowing promotion you will have an error.

A midway solution would be to allow promotion only if promote_type produces a concrete type or Union{concrete_type, Missing} or Union{previous_eltype_of_column,Missing}, and fail if it produces an abstract type.

So in short - I am open to change what I have proposed, and all options have pros and cons unfortunately.

Edit: also needs tests but you probably knew that already ;)

What do you mean by this (I thought I have provided an adequate test coverage). The tests I have written pass, the ❌ is only due to unrelated CI failures.

kleinschmidt · 2020-03-13T18:56:40Z

I don't see any changes to test files here, just the source...

pdeffebach · 2020-03-13T19:26:00Z

I would have expected something like

push!(df, x; cols = :union, promote_type = true)

Where cols = :union automatically promotes to missing but errors otherwise, except when promote_type is used.

bkamins · 2020-03-13T19:54:25Z

@kleinschmidt - thank you (I have forgotten to stage the test file - pushed now 😄)

bkamins · 2020-03-13T20:00:45Z

@pdeffebach - intuitively I would prefer to keep a single kwarg not to complicate the design and pick the behavior of :union that is most useful.

Given this consideration I preferred to make :union do promotion as this seemed to me "more useful".

In general I think you shoulud use :union if you are unsure what you will get. If you know what you expect to get I think one should construct a DataFrame with all needed columns empty and of correct types and then use cols=:subset in push!.

pdeffebach · 2020-03-13T22:35:05Z

In general I think you shoulud use :union if you are unsure what you will get.

Sounds good! :union is an appropriate term for this, with its double meaning and all.

bkamins · 2020-03-13T22:48:15Z

So this was my thinking. But just to make sure we are on the right track. Let us try to think of a case when this kind of promotion in :union would be a bad thing. I.e. in what scenario one would prefer to get an error rather than auto-promotion. I could not come up with such a convincing scenario, but maybe there is some that would change our minds. Thank you for the time you spend on it!

pdeffebach · 2020-03-13T23:49:20Z

I agree that no one who is too "lazy" to make sure all their named-tuples have the same number of fields is going to care that much about unexpected autopromotion.

The only scenario I can think of is someone who wants auto-promotion of types but doesn't realize their named tuples have different elements. Then they unexpectedly create new columns.

But this seems very unlikely.

The vast majority of use-cases is someone applying a function that returns a named tuple and then pushes to the data frame in a loop. Why would the function return an inconsistent set of fields?
Perhaps they are parsing a string. Then they should use CSV.jl or JSONTables.jl, if they aren't doing that they are probably an advanced user who can write their own function which normalizes fields.

The only other concern would be discoverability. This is solved by good documentation and perhaps a few discourse answers.

kleinschmidt · 2020-03-14T01:14:48Z

I think the scenario I had in mind was more that someone would use cols=:union once and then expect that auto-promotion is the default for the other options as well.

kleinschmidt · 2020-03-14T01:15:15Z

But I agree that's unlikely given that CSV.jl/JSON3.jl handle the promotion for you....

bkamins · 2020-03-14T09:51:39Z

I think the scenario I had in mind was more that someone would use cols=:union once

Yeah, I was afraid of it too. I even considered adding push!! instead of cols=:union to avoid this duality. But I came to the conclusion that:

if someone uses push! one has to read the documentation in order to use cols=:union (as otherwise how can you learn about it)
if you do not do cols=:union in push! and get an error - well - then you read the documentation to learn that cols=:union is a special case

Because of this in the initial message in this PR I have made the comment

(also comments helping to refine a docstring are welcome, as I usually do not get it right)

as proper documentation is important here.

The key concern was that push! in Base clearly does not auto-promote and will never do (because it is not able to mutate the type of the container as opposed to the case of a DataFrame). And again - exactly for these reasons we did not implement :union for push! in the past.

So in summary we have the situation:

have a small inconsistency in push! as proposed in current PR which I simply think is a convenient functionality for its use case (reading data from an unknown source) at the cost of a small inconsistency
not have this functionality at all (and here the point of @kleinschmidt initial comment was that he wanted it 😄), or have a separate function for it (like push!! which has a problem that I feared that would be less discoverable by users than an additional option to push!), or further complicate the API of push! with another kwarg (I wanted to avoid this as `push! is already complex as witnessed by its docstring)

I understand that we converge to accepting what is currently implemented in the PR - right?

Thank you for discussing this.

nalimilan · 2020-03-26T10:53:54Z

Sorry for not commenting earlier. I think I'd rather have a separate promote keyword argument. That makes things simpler and more powerful, as various combinations with cols could make sense. Apart from the ones with cols=:union, I think it can be not too uncommon with the default cols=:setequal too. As @bkamins mentioned, if you start with one row (e.g. when running computations in a loop and aggregating results) and later get missing values (or the reverse), you need promotion, yet it would be weird (and hard to find) to use cols=:union since you don't want to add new columns at all. The same thing can happen if you want to append a new dataset loaded from CSV (for append!).

If we want a flexible mode, we could have another argument for convenience which will set cols=:union, promote=true.

bkamins · 2020-03-26T12:14:47Z

What you propose is essentially the initial design that was planned.

It is more complex, but we can go for it. I would simply set promote column by default to different values depending on cols column.

I think it is natural that for :union and :subset it is set to true (to allow population with missing if needed as a common case) and for all other it can be set to false.

If we go for it in a separate PR I can update append! to be synced with push! as it would go along the same logic as you note.

@nalimilan - please confirm that you are OK with this plan and then I can go forward with implementing the functionality.

bkamins · 2020-03-28T12:42:24Z

@nalimilan - I have implemented what you proposed. In the end it seems it is not that problematic as I feared. The only issue is that if we have column aliases then with promote=true we will not be able to revert the DataFrame to its original state, but I think it is acceptable.

bkamins · 2020-04-09T16:39:45Z

I have added the same arguments to append! to support #2173. It is left to update the tests now (I will do it soon).

bkamins · 2020-04-09T18:05:25Z

@nalimilan - all tests added.

src/dataframe/dataframe.jl

src/dataframerow/dataframerow.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-04-12T22:09:03Z

Thank you for a huge effort to review this. I hope I have fixed everything.

bkamins · 2020-04-14T15:39:51Z

CI passes here.

src/dataframe/dataframe.jl

bkamins · 2020-04-16T12:37:44Z

Thank you!

allow :union as cols kwarg in push!

c696d85

bkamins added the non-breaking The proposed change is not breaking label Mar 13, 2020

bkamins added this to the 1.x milestone Mar 13, 2020

kleinschmidt reviewed Mar 13, 2020

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

add tests

6227e9c

bkamins linked an issue Mar 21, 2020 that may be closed by this pull request

push! which promotes type #1716

Closed

Merge branch 'master' into union_in_push

ceb0547

add promote kwarg

cf7906f

This was referenced Apr 3, 2020

add missing columns when push! ing? #2150

Closed

Sync API of append! with push! #2173

Closed

bkamins modified the milestones: 1.x, 1.0 Apr 6, 2020

bkamins mentioned this pull request Apr 8, 2020

push! with cols=:subset is not allowed if there is no missing in previous data #2179

Closed

union in append!

71dc718

bkamins changed the title ~~allow :union as cols kwarg in push!~~ allow :union as cols kwarg in push! and append! Apr 9, 2020

This was linked to issues Apr 9, 2020

Sync API of append! with push! #2173

Closed

Unify push!, append! and vcat implementation. #2032

Closed

This was referenced Apr 9, 2020

Unify push!, append! and vcat implementation. #2032

Closed

Unify push!, append! and vcat implementation. #2034

Closed

bkamins added 3 commits April 9, 2020 19:53

add tests

d1f6b35

Merge branch 'master' into union_in_push

336a0fb

small test cleanup (not worth a new PR)

e9fe49d

nalimilan reviewed Apr 12, 2020

View reviewed changes

bkamins and others added 5 commits April 12, 2020 23:13

Apply suggestions from code review

a85fcfd

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Apply suggestions from code review

400bb92

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Apply suggestions from code review

82e6f12

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Apply suggestions from code review

5fef7c2

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

corrections after code review

85ea86b

nalimilan approved these changes Apr 16, 2020

View reviewed changes

added code comment and a minor layout update

fef36b3

nalimilan reviewed Apr 16, 2020

View reviewed changes

src/dataframe/dataframe.jl Show resolved Hide resolved

one more comment

2e3b697

nalimilan approved these changes Apr 16, 2020

View reviewed changes

bkamins merged commit 580c421 into JuliaData:master Apr 16, 2020

bkamins deleted the union_in_push branch April 16, 2020 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow :union as cols kwarg in push! and append! #2152

allow :union as cols kwarg in push! and append! #2152

bkamins commented Mar 13, 2020

kleinschmidt left a comment •

edited

Loading

bkamins commented Mar 13, 2020

kleinschmidt commented Mar 13, 2020

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020

bkamins commented Mar 13, 2020

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020

pdeffebach commented Mar 13, 2020

kleinschmidt commented Mar 14, 2020

kleinschmidt commented Mar 14, 2020

bkamins commented Mar 14, 2020

nalimilan commented Mar 26, 2020

bkamins commented Mar 26, 2020

bkamins commented Mar 28, 2020

bkamins commented Apr 9, 2020

bkamins commented Apr 9, 2020

bkamins commented Apr 12, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 16, 2020

allow :union as cols kwarg in push! and append! #2152

allow :union as cols kwarg in push! and append! #2152

Conversation

bkamins commented Mar 13, 2020

kleinschmidt left a comment • edited Loading

Choose a reason for hiding this comment

bkamins commented Mar 13, 2020

kleinschmidt commented Mar 13, 2020

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020

bkamins commented Mar 13, 2020

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020

pdeffebach commented Mar 13, 2020

kleinschmidt commented Mar 14, 2020

kleinschmidt commented Mar 14, 2020

bkamins commented Mar 14, 2020

nalimilan commented Mar 26, 2020

bkamins commented Mar 26, 2020

bkamins commented Mar 28, 2020

bkamins commented Apr 9, 2020

bkamins commented Apr 9, 2020

bkamins commented Apr 12, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 16, 2020

kleinschmidt left a comment •

edited

Loading