Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbind giving errors with filter #606

Closed
jrvianna opened this issue Sep 18, 2014 · 12 comments
Closed

rbind giving errors with filter #606

jrvianna opened this issue Sep 18, 2014 · 12 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@jrvianna
Copy link

When joining two dataframes with rbind sometimes would result in an object that appears to be correct, but with incorrect structure, that gives errors when using filter function. More detailed and comments on: http://stackoverflow.com/questions/25919927/rbind-tbl-and-df-gives-errors-with-filter?noredirect=1#comment40576529_25919927

df1 <- data.frame(
   group = factor(rep(c("C", "G"), 5)),
   value = 1:10)
df1 <- df1 %>% group_by(group) #df1 is now tbl
df2 <- data.frame(
   group = factor(rep("G", 10)),
   value = 11:20)
df3 <- rbind(df1, df2) #df2 is data.frame
df3 %>% filter(group == "C") #returns filtered rows in df1 and all rows of df2
Source: local data frame [15 x 2]
Groups: group

  group value
1      C     1
2      C     3
3      C     5
4      C     7
5      C     9
6      G    11
7      G    12
8      G    13
9      G    14
10     G    15
11     G    16
12     G    17
13     G    18
14     G    19
15     G    20
@romainfrancois romainfrancois self-assigned this Sep 21, 2014
@romainfrancois romainfrancois added the bug an unexpected problem or unintended behavior label Sep 21, 2014
@romainfrancois romainfrancois added this to the 0.3 milestone Sep 21, 2014
@romainfrancois
Copy link
Member

rbind.data.frame keeps attributes from the first data frame for some reason:

> str(df3)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of  2 variables:
 $ group: Factor w/ 2 levels "C","G": 1 2 1 2 1 2 1 2 1 2 ...
 $ value: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol group
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 2
  ..$ : int  0 2 4 6 8
  ..$ : int  1 3 5 7 9
 - attr(*, "group_sizes")= int  5 5
 - attr(*, "biggest_group_size")= int 5
 - attr(*, "labels")='data.frame':  2 obs. of  1 variable:
  ..$ group: Factor w/ 2 levels "C","G": 1 2
  ..- attr(*, "vars")=List of 1
  .. ..$ : symbol group

I also tried to add a rbind.tbl_df method using rbind_list, but I must be dumb or something bc this is what I get:

> rbind.tbl_df <- function(...) rbind_list(...)
> rbind(df1, df2)
    group     value
df1 factor,10 Integer,10
df2 factor,10 Integer,10

Relatedly: https://twitter.com/romain_francois/status/513634697600331776

@romainfrancois
Copy link
Member

Anyway, back to the original problem, one thing I could do is somehow check consistency between the attributes of a grouped data frame and the data frame itself, i.e.

> sum( attr(df3, "group_sizes") )
[1] 10
> nrow(df3)
[1] 20

Those are different, so it is not valid with respect to grouped_df, but it might get in the way of other forms of groupings because this would make the assumption that a grouped_df has one row in one and only one group and each row belong to a group. @hadley ?

@romainfrancois
Copy link
Member

This also have undesired effects fo other verbs, e.g.:

> df3 %>% mutate( value2 = value + 1 )
Source: local data frame [20 x 3]
Groups: group

   group value         value2
1      C     1   2.000000e+00
2      G     2   3.000000e+00
3      C     3   4.000000e+00
4      G     4   5.000000e+00
5      C     5   6.000000e+00
6      G     6   7.000000e+00
7      C     7   8.000000e+00
8      G     8   9.000000e+00
9      C     9   1.000000e+01
10     G    10   1.100000e+01
11     G    11 -1.218652e-280
12     G    12 -5.964388e-181
13     G    13   1.397168e-78
14     G    14 -3.526465e+102
15     G    15   7.976847e-01
16     G    16   2.036815e-71
17     G    17  8.527626e-249
18     G    18 -3.508129e+277
19     G    19   8.292691e-88
20     G    20  6.934274e-310

@romainfrancois
Copy link
Member

I think the problem is that I'm implicitely assuming what I described above: for a grouped_df all rows are in one and only one group.

So I could either:

  • assert that early, i.e. whenever I create a GroupedDataFrame
  • fix it. this would change the impl of mutate where we could no longer rely on shallow copy of the columns unless we were sure that the assumption holds.

The next problem is that asserting the assumption might be expensive. We can easily enough make the check about group_sizes for cheap, but that would not be enough.

So far, we've sort of worked under the assumption that we create the grouped_df object and therefore we know how to do that. The problem is that rbind creates an invalid grouped_df object.

Perhaps that is a documentation issue and we should not use rbind in the first place, we have rbind_list anyway.

Anyway, I'll hold off on this one until I get some guidance on what to do.

romainfrancois added a commit that referenced this issue Sep 22, 2014
…which

  prevent some issues related to corrupt `grouped_df` objects as the one
  made by rbind (#606).
@romainfrancois
Copy link
Member

I've implemented the test in GroupedDataFrame. https://github.com/hadley/dplyr/blob/f716b022d8cd0a023b83c6696c8e30fdcaaa6c32/inst/include/dplyr/GroupedDataFrame.h#L40

            if( !is_lazy ){
                // check consistency of the groups
                int rows_in_groups = sum(group_sizes) ;
                if( data_.nrows() != rows_in_groups ){
                    std::stringstream s ; 
                    s << "corrupt 'grouped_df', contains "
                      << data_.nrows()
                      << " rows, and "
                      << rows_in_groups
                      << " rows in groups" ;
                    stop(s.str()) ;
                }
            }

As said above, this does not guarantee complete coherence of the grouped_df but at least it filters off corrupt data as the one made by rbind in that case.

@hadley
Copy link
Member

hadley commented Sep 22, 2014

I'll see if I can figure out how to make rbind() behave properly in this scenario.

@hadley hadley assigned hadley and unassigned romainfrancois Sep 22, 2014
@romainfrancois
Copy link
Member

Good luck with that; I'd be curious how this is done.

@hadley
Copy link
Member

hadley commented Sep 22, 2014

Hmmm, ok, I guess it's unfixable due to the crazy dispatch that rbind() uses.

@arunsrinivasan do you do anything to fix this for data.table?

@hadley hadley closed this as completed Sep 22, 2014
@arunsrinivasan
Copy link
Contributor

@hadley FAQ 2.23 pretty much explains the issue and the current workaround data.table has...

@hadley
Copy link
Member

hadley commented Sep 23, 2014

@arunsrinivasan and R CMD check lets you get away with that?

@arunsrinivasan
Copy link
Contributor

@hadley, that's what Matt, after exhausting all other options (as explained in the FAQ), has managed to do to get around this issue. It'd be great if someone comes up with a better fix...

@romainfrancois
Copy link
Member

R CMD check does not have to know ;)

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants