Windowed rank functions need NA handling #774

gvfarns · 2014-11-14T03:03:23Z

Each of the windowed rank functions could really use an NA handling parameter, similar to what is done in base R's rank() function with the parameter na.last.

At present it appears that NA is always last (highest rank), which is equivalent to na.last=TRUE and there is no obvious way to change this.

Of particular importance is the na.last=NA case, where any missing data is not included in the ranking (it gets a rank of NA). For me, at least, exclusion of missing values from rankings is the most common desired case.

The text was updated successfully, but these errors were encountered:

hadley · 2014-11-14T13:01:24Z

I can't see how na.last = NA could work for ranking functions without breaking the interface that length(output) == length(input).

Could you provide a use case for why you'd want to rank NA's higher than every other value?

gvfarns · 2014-11-14T15:19:29Z

For na.last=NA we don't remove observations, we just give them a ranking of NA. For example, my workaround when I learned the dplyr function ntile() would not handle NA as I needed it to was the following function

ntile_na <- function(x,n)
{
  notna <- !is.na(x)
  out <- rep(NA_real_,length(x))
  out[notna] <- ntile(x[notna],n)
  return(out)
}

The other rank functions are amenable to a similar workaround.

With respect to use cases, it's a question of how to interpret NA values, which the researcher knows but as developers, we don't. Here are some examples:

If X is a set of times to completion, then NA could mean it never completed. Makes sense to group NA with the highest decile.
if X is the score a student got on a test, then NA could mean they did not take it. Might make sense to group those students with those who got the lowest score.
if X is a characteristic and some individuals did not report it, it makes sense not give those individuals a ranking of NA, meaning "unknown". My current work involves an analysis of hedge fund returns and flows and other characteristics. Hedge funds occasionally do not report their flows, so we can't rank them relative to their peers in those months. They may report one characteristic but not another so we don't necessarily want to drop the whole observation from the ranking process.

We can't reasonably impose our view that the data should be in the form of case 1 since as developers we don't know what the data means nor the correct way to interpret NA for that data.

hadley · 2014-11-14T15:24:09Z

Oh, I see - I think that na.last = NA should be the standard behaviour for all ranking functions. You can get the other behaviours by trivially modifying the data (e.g. to use -Inf or Inf).

gvfarns · 2014-11-14T15:26:28Z

I agree completely. However, you might consider keeping the default as it is now to be compatible with the rank() function (and therefore many people's expectations), which has default behavior that matches that of the dplyr ranking functions. I'll leave it up to you to make that call.

hadley · 2014-11-18T21:47:38Z

@romainfrancois does the hybrid evaluator also have implementations of these functions that need to be updated?

romainfrancois · 2014-12-10T16:52:35Z

Yes.

        handlers[ Rf_install( "min_rank" )       ] = rank_impl_prototype<dplyr::internal::min_rank_increment> ;
        handlers[ Rf_install( "percent_rank" )   ] = rank_impl_prototype<dplyr::internal::percent_rank_increment> ;
        handlers[ Rf_install( "dense_rank" )     ] = rank_impl_prototype<dplyr::internal::dense_rank_increment> ;
        handlers[ Rf_install( "cume_dist" )      ] = rank_impl_prototype<dplyr::internal::cume_dist_increment> ;

        handlers[ Rf_install( "ntile" )          ] = ntile_prototype ;

The nice thing is that they all share the same implementation, rank_impl_prototype so we would only need to handle it once.

romainfrancois · 2014-12-10T18:27:04Z

However, for all of the functions handled by the rank_impl_prototype template, hybrid evaluation only occurs if the number of arguments is one, so for "correct" handling of any extra argument, it can happen at the R layer.

Anyhow, what should the interface be for min_rank etc ... ?

min_rank and dense_rank could be sort of easy to modify. Bu perhaps cume_dist and percent_rank are more of a problem as we sort of need to know how many NA there are.

It looks as if the R versions of the functions did propagate the NA:

> x <- c(1,2,NA,1,0, NA)
> percent_rank(x)
[1] 0.2 0.6  NA 0.2 0.0  NA
> dense_rank(x)
[1]  2  3 NA  2  1 NA
> min_rank(x)
[1]  2  4 NA  2  1 NA
> cume_dist(x)
[1] 0.5000000 0.6666667        NA 0.5000000 0.1666667        NA

> ( data.frame( x = x ) %>% mutate( out = percent_rank(x) ) ) $out
[1] 0.2 0.6 0.8 0.2 0.0 0.8
> ( data.frame( x = x ) %>% mutate( out = dense_rank(x) ) ) $out
[1] 2 3 4 2 1 4
> ( data.frame( x = x ) %>% mutate( out = min_rank(x) ) ) $out
[1] 2 4 5 2 1 5
> ( data.frame( x = x ) %>% mutate( out = cume_dist(x) ) ) $out
[1] 0.5000000 0.6666667 1.0000000 0.5000000 0.1666667 1.0000000

So should I mimic those results, or do we need some ways to control NA, ...

For cume_dist and percent_rank the R functions use the total number of elements from the vector and not the number of non NA as was hinted by @gvfarns above. What the R functions are doing would be easier to implement internally, but is this the right results @hadley ?

hadley · 2014-12-10T18:29:47Z

Oh oops, I'd say that's a bug in my R implementation. The denominator should be the number of non-NAs, not the length.

We don't need options to control behave, just ensure that NAs in input are NA in output

…e_rank. #774

romainfrancois · 2014-12-12T17:45:38Z

Did the first set of functions: min_rank, percent_rank, dense_rank and cume_dist. Also I think I fixed the R implementation of them.

I guess next is ntile

hadley added a commit that referenced this issue Nov 18, 2014

Ranking functions should preserve missing values #774

bc02fe0

hadley added the feature a feature request or enhancement label Nov 20, 2014

hadley added this to the 0.3.1 milestone Nov 20, 2014

hadley assigned romainfrancois Nov 20, 2014

romainfrancois added a commit that referenced this issue Dec 12, 2014

correct handling of NA for cume_dist, percent_rank, min_rank and dens…

f5e2752

…e_rank. #774

romainfrancois added a commit that referenced this issue Dec 13, 2014

internal ntile correctly deals with NA. #774

a3f4912

romainfrancois closed this as completed in a26b1d7 Dec 14, 2014

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windowed rank functions need NA handling #774

Windowed rank functions need NA handling #774

gvfarns commented Nov 14, 2014

hadley commented Nov 14, 2014

gvfarns commented Nov 14, 2014

hadley commented Nov 14, 2014

gvfarns commented Nov 14, 2014

hadley commented Nov 18, 2014

romainfrancois commented Dec 10, 2014

romainfrancois commented Dec 10, 2014

hadley commented Dec 10, 2014

romainfrancois commented Dec 12, 2014

Windowed rank functions need NA handling #774

Windowed rank functions need NA handling #774

Comments

gvfarns commented Nov 14, 2014

hadley commented Nov 14, 2014

gvfarns commented Nov 14, 2014

hadley commented Nov 14, 2014

gvfarns commented Nov 14, 2014

hadley commented Nov 18, 2014

romainfrancois commented Dec 10, 2014

romainfrancois commented Dec 10, 2014

hadley commented Dec 10, 2014

romainfrancois commented Dec 12, 2014