rbind giving errors with filter #606

jrvianna · 2014-09-18T20:15:59Z

When joining two dataframes with rbind sometimes would result in an object that appears to be correct, but with incorrect structure, that gives errors when using filter function. More detailed and comments on: http://stackoverflow.com/questions/25919927/rbind-tbl-and-df-gives-errors-with-filter?noredirect=1#comment40576529_25919927

df1 <- data.frame(
   group = factor(rep(c("C", "G"), 5)),
   value = 1:10)
df1 <- df1 %>% group_by(group) #df1 is now tbl
df2 <- data.frame(
   group = factor(rep("G", 10)),
   value = 11:20)
df3 <- rbind(df1, df2) #df2 is data.frame
df3 %>% filter(group == "C") #returns filtered rows in df1 and all rows of df2
Source: local data frame [15 x 2]
Groups: group

  group value
1      C     1
2      C     3
3      C     5
4      C     7
5      C     9
6      G    11
7      G    12
8      G    13
9      G    14
10     G    15
11     G    16
12     G    17
13     G    18
14     G    19
15     G    20

The text was updated successfully, but these errors were encountered:

romainfrancois · 2014-09-21T10:31:35Z

rbind.data.frame keeps attributes from the first data frame for some reason:

> str(df3)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of  2 variables:
 $ group: Factor w/ 2 levels "C","G": 1 2 1 2 1 2 1 2 1 2 ...
 $ value: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol group
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 2
  ..$ : int  0 2 4 6 8
  ..$ : int  1 3 5 7 9
 - attr(*, "group_sizes")= int  5 5
 - attr(*, "biggest_group_size")= int 5
 - attr(*, "labels")='data.frame':  2 obs. of  1 variable:
  ..$ group: Factor w/ 2 levels "C","G": 1 2
  ..- attr(*, "vars")=List of 1
  .. ..$ : symbol group

I also tried to add a rbind.tbl_df method using rbind_list, but I must be dumb or something bc this is what I get:

> rbind.tbl_df <- function(...) rbind_list(...)
> rbind(df1, df2)
    group     value
df1 factor,10 Integer,10
df2 factor,10 Integer,10

Relatedly: https://twitter.com/romain_francois/status/513634697600331776

romainfrancois · 2014-09-21T11:06:12Z

Anyway, back to the original problem, one thing I could do is somehow check consistency between the attributes of a grouped data frame and the data frame itself, i.e.

> sum( attr(df3, "group_sizes") )
[1] 10
> nrow(df3)
[1] 20

Those are different, so it is not valid with respect to grouped_df, but it might get in the way of other forms of groupings because this would make the assumption that a grouped_df has one row in one and only one group and each row belong to a group. @hadley ?

romainfrancois · 2014-09-21T11:09:36Z

This also have undesired effects fo other verbs, e.g.:

> df3 %>% mutate( value2 = value + 1 )
Source: local data frame [20 x 3]
Groups: group

   group value         value2
1      C     1   2.000000e+00
2      G     2   3.000000e+00
3      C     3   4.000000e+00
4      G     4   5.000000e+00
5      C     5   6.000000e+00
6      G     6   7.000000e+00
7      C     7   8.000000e+00
8      G     8   9.000000e+00
9      C     9   1.000000e+01
10     G    10   1.100000e+01
11     G    11 -1.218652e-280
12     G    12 -5.964388e-181
13     G    13   1.397168e-78
14     G    14 -3.526465e+102
15     G    15   7.976847e-01
16     G    16   2.036815e-71
17     G    17  8.527626e-249
18     G    18 -3.508129e+277
19     G    19   8.292691e-88
20     G    20  6.934274e-310

romainfrancois · 2014-09-21T11:24:28Z

I think the problem is that I'm implicitely assuming what I described above: for a grouped_df all rows are in one and only one group.

So I could either:

assert that early, i.e. whenever I create a GroupedDataFrame
fix it. this would change the impl of mutate where we could no longer rely on shallow copy of the columns unless we were sure that the assumption holds.

The next problem is that asserting the assumption might be expensive. We can easily enough make the check about group_sizes for cheap, but that would not be enough.

So far, we've sort of worked under the assumption that we create the grouped_df object and therefore we know how to do that. The problem is that rbind creates an invalid grouped_df object.

Perhaps that is a documentation issue and we should not use rbind in the first place, we have rbind_list anyway.

Anyway, I'll hold off on this one until I get some guidance on what to do.

…which prevent some issues related to corrupt `grouped_df` objects as the one made by rbind (#606).

romainfrancois · 2014-09-22T06:21:24Z

I've implemented the test in GroupedDataFrame. https://github.com/hadley/dplyr/blob/f716b022d8cd0a023b83c6696c8e30fdcaaa6c32/inst/include/dplyr/GroupedDataFrame.h#L40

            if( !is_lazy ){
                // check consistency of the groups
                int rows_in_groups = sum(group_sizes) ;
                if( data_.nrows() != rows_in_groups ){
                    std::stringstream s ; 
                    s << "corrupt 'grouped_df', contains "
                      << data_.nrows()
                      << " rows, and "
                      << rows_in_groups
                      << " rows in groups" ;
                    stop(s.str()) ;
                }
            }

As said above, this does not guarantee complete coherence of the grouped_df but at least it filters off corrupt data as the one made by rbind in that case.

hadley · 2014-09-22T12:13:36Z

I'll see if I can figure out how to make rbind() behave properly in this scenario.

romainfrancois · 2014-09-22T12:23:14Z

Good luck with that; I'd be curious how this is done.

hadley · 2014-09-22T19:28:45Z

Hmmm, ok, I guess it's unfixable due to the crazy dispatch that rbind() uses.

@arunsrinivasan do you do anything to fix this for data.table?

arunsrinivasan · 2014-09-23T14:36:56Z

@hadley FAQ 2.23 pretty much explains the issue and the current workaround data.table has...

hadley · 2014-09-23T14:43:19Z

@arunsrinivasan and R CMD check lets you get away with that?

arunsrinivasan · 2014-09-23T15:08:30Z

@hadley, that's what Matt, after exhausting all other options (as explained in the FAQ), has managed to do to get around this issue. It'd be great if someone comes up with a better fix...

romainfrancois · 2014-09-23T16:22:59Z

R CMD check does not have to know ;)

romainfrancois self-assigned this Sep 21, 2014

romainfrancois added the bug an unexpected problem or unintended behavior label Sep 21, 2014

romainfrancois added this to the 0.3 milestone Sep 21, 2014

romainfrancois added a commit that referenced this issue Sep 22, 2014

GroupedDataFrame performs some checks on the grouped_df objects, …

f716b02

…which prevent some issues related to corrupt `grouped_df` objects as the one made by rbind (#606).

hadley assigned hadley and unassigned romainfrancois Sep 22, 2014

hadley closed this as completed Sep 22, 2014

krlmlr mentioned this issue Jan 26, 2017

Provide rbind method tidyverse/tibble#34

Open

krlmlr mentioned this issue Feb 10, 2017

rbind on grouped data produces a "nested data frame" #2138

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbind giving errors with filter #606

rbind giving errors with filter #606

jrvianna commented Sep 18, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 22, 2014

hadley commented Sep 22, 2014

romainfrancois commented Sep 22, 2014

hadley commented Sep 22, 2014

arunsrinivasan commented Sep 23, 2014

hadley commented Sep 23, 2014

arunsrinivasan commented Sep 23, 2014

romainfrancois commented Sep 23, 2014

rbind giving errors with filter #606

rbind giving errors with filter #606

Comments

jrvianna commented Sep 18, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 21, 2014

romainfrancois commented Sep 22, 2014

hadley commented Sep 22, 2014

romainfrancois commented Sep 22, 2014

hadley commented Sep 22, 2014

arunsrinivasan commented Sep 23, 2014

hadley commented Sep 23, 2014

arunsrinivasan commented Sep 23, 2014

romainfrancois commented Sep 23, 2014