Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore nulls for n_distinct() #1052

Closed
saurabhRTR opened this issue Mar 31, 2015 · 5 comments
Closed

Ignore nulls for n_distinct() #1052

saurabhRTR opened this issue Mar 31, 2015 · 5 comments
Assignees
Milestone

Comments

@saurabhRTR
Copy link

Consider the following -

       uid term_number order_id
68 1001190           0  1985608
69 1001190           0  2052320
70 1001190           0  2089064
71 1001190           1  2125056
72 1001190           2  2275108
73 1001190           2  2296768
74 1001190           2  2343148
75 1001190           3  2474898
76 1001190           4  2676880
77 1001190           5  2718370
78 1001190           6       NA
79 1001190           7  3109466
80 1001190           7  3132486


mydf %.% 
group_by(uid, term_number) %.% 
summarize(n_distinct(order_id))


Source: local data frame [8 x 3]
Groups: uid

      uid term_number n_distinct(order_id)
1 1001190           0                    3
2 1001190           1                    1
3 1001190           2                    3
4 1001190           3                    1
5 1001190           4                    1
6 1001190           5                    1
7 1001190           6                    1
8 1001190           7                    2

This says 1 order for term 6 where it should be 0. This is because n_distinct does not ignore nulls. I guess it makes sense in some cases, so ideally a flag to ignore nulls would be useful.

PS, Most databases will ignore null by default. R's distinct will not.

Thanks,

@joranE
Copy link
Contributor

joranE commented Apr 1, 2015

What makes this particularly problematic is that while this works fine:

x <- c(1,2,3,NA)
> n_distinct(x)
[1] 4
> n_distinct(x[!is.na(x)])
[1] 3

...this does not:

> mtcars %>% group_by(cyl) %>% summarise(val = n_distinct(carb[!is.na(carb)]))
Error: Input to n_distinct() must be a single variable name from the data set

which makes this rather impossible to "work around" ourselves.

@romainfrancois
Copy link
Member

For a workaround, you can do e.g.

n_distinct_no_na <- function(x) n_distinct( x[!is.na(x] )
mtcars %>% group_by(cyl) %>% summarise(val = n_distinct_no_na(carb) )

@saurabhRTR
Copy link
Author

Thanks for that suggestion Romain. I tried that on my dataset above. But I got -

mydf %.% 
group_by(uid, term_number) %.% 
summarize(n_distinct_no_na(order_id))

 Error: upper value must be greater than lower value 
10 stop(structure(list(message = "upper value must be greater than lower value", 
    call = NULL, cppstack = NULL), .Names = c("message", "call", 
"cppstack"), class = c("std::range_error", "C++Error", "error", 
"condition"))) 
9 n_distinct(x[!is.na(x)]) 
8 n_distinct_no_na(NA_integer_) 
7 summarise_impl(.data, named_dots(...), environment()) 
6 summarise.tbl_df(`__prev`, n_distinct_no_na(order_id)) 
5 summarize(`__prev`, n_distinct_no_na(order_id)) 
4 eval(expr, envir, enclos) 
3 eval(new_call, e) 
2 chain_q(list(substitute(lhs), substitute(rhs)), env = parent.frame()) 
1

@hadley hadley added this to the 0.5 milestone May 19, 2015
@hadley
Copy link
Member

hadley commented May 19, 2015

I think n_distinct() should gain argument na.rm = FALSE. If you want to drop NAs, you'd do na.rm = TRUE.

@saurabhRTR
Copy link
Author

Thanks! Now I'm able to port more code over from sqldf to dplyr.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants