Ignore nulls for n_distinct() #1052

saurabhRTR · 2015-03-31T20:15:55Z

Consider the following -

       uid term_number order_id
68 1001190           0  1985608
69 1001190           0  2052320
70 1001190           0  2089064
71 1001190           1  2125056
72 1001190           2  2275108
73 1001190           2  2296768
74 1001190           2  2343148
75 1001190           3  2474898
76 1001190           4  2676880
77 1001190           5  2718370
78 1001190           6       NA
79 1001190           7  3109466
80 1001190           7  3132486


mydf %.% 
group_by(uid, term_number) %.% 
summarize(n_distinct(order_id))


Source: local data frame [8 x 3]
Groups: uid

      uid term_number n_distinct(order_id)
1 1001190           0                    3
2 1001190           1                    1
3 1001190           2                    3
4 1001190           3                    1
5 1001190           4                    1
6 1001190           5                    1
7 1001190           6                    1
8 1001190           7                    2

This says 1 order for term 6 where it should be 0. This is because n_distinct does not ignore nulls. I guess it makes sense in some cases, so ideally a flag to ignore nulls would be useful.

PS, Most databases will ignore null by default. R's distinct will not.

Thanks,

The text was updated successfully, but these errors were encountered:

joranE · 2015-04-01T21:46:47Z

What makes this particularly problematic is that while this works fine:

x <- c(1,2,3,NA)
> n_distinct(x)
[1] 4
> n_distinct(x[!is.na(x)])
[1] 3

...this does not:

> mtcars %>% group_by(cyl) %>% summarise(val = n_distinct(carb[!is.na(carb)]))
Error: Input to n_distinct() must be a single variable name from the data set

which makes this rather impossible to "work around" ourselves.

romainfrancois · 2015-04-02T07:08:32Z

For a workaround, you can do e.g.

n_distinct_no_na <- function(x) n_distinct( x[!is.na(x] )
mtcars %>% group_by(cyl) %>% summarise(val = n_distinct_no_na(carb) )

saurabhRTR · 2015-04-02T14:54:19Z

Thanks for that suggestion Romain. I tried that on my dataset above. But I got -

mydf %.% 
group_by(uid, term_number) %.% 
summarize(n_distinct_no_na(order_id))

 Error: upper value must be greater than lower value 
10 stop(structure(list(message = "upper value must be greater than lower value", 
    call = NULL, cppstack = NULL), .Names = c("message", "call", 
"cppstack"), class = c("std::range_error", "C++Error", "error", 
"condition"))) 
9 n_distinct(x[!is.na(x)]) 
8 n_distinct_no_na(NA_integer_) 
7 summarise_impl(.data, named_dots(...), environment()) 
6 summarise.tbl_df(`__prev`, n_distinct_no_na(order_id)) 
5 summarize(`__prev`, n_distinct_no_na(order_id)) 
4 eval(expr, envir, enclos) 
3 eval(new_call, e) 
2 chain_q(list(substitute(lhs), substitute(rhs)), env = parent.frame()) 
1

hadley · 2015-05-19T15:44:01Z

I think n_distinct() should gain argument na.rm = FALSE. If you want to drop NAs, you'd do na.rm = TRUE.

saurabhRTR · 2015-07-08T19:24:17Z

Thanks! Now I'm able to port more code over from sqldf to dplyr.

hadley added this to the 0.5 milestone May 19, 2015

hadley assigned romainfrancois May 19, 2015

romainfrancois added a commit that referenced this issue Jun 28, 2015

adding tests for n_distinct(na.rm=TRUE). #1052

c704376

romainfrancois added a commit that referenced this issue Jun 28, 2015

adding na_rm argument for front end to n_distinct #1052

e1e45bd

romainfrancois closed this as completed Jul 8, 2015

lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore nulls for n_distinct() #1052

Ignore nulls for n_distinct() #1052

saurabhRTR commented Mar 31, 2015

joranE commented Apr 1, 2015

romainfrancois commented Apr 2, 2015

saurabhRTR commented Apr 2, 2015

hadley commented May 19, 2015

saurabhRTR commented Jul 8, 2015

Ignore nulls for n_distinct() #1052

Ignore nulls for n_distinct() #1052

Comments

saurabhRTR commented Mar 31, 2015

joranE commented Apr 1, 2015

romainfrancois commented Apr 2, 2015

saurabhRTR commented Apr 2, 2015

hadley commented May 19, 2015

saurabhRTR commented Jul 8, 2015