Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windowed rank functions need NA handling #774

Closed
gvfarns opened this issue Nov 14, 2014 · 9 comments
Closed

Windowed rank functions need NA handling #774

gvfarns opened this issue Nov 14, 2014 · 9 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@gvfarns
Copy link

gvfarns commented Nov 14, 2014

Each of the windowed rank functions could really use an NA handling parameter, similar to what is done in base R's rank() function with the parameter na.last.

At present it appears that NA is always last (highest rank), which is equivalent to na.last=TRUE and there is no obvious way to change this.

Of particular importance is the na.last=NA case, where any missing data is not included in the ranking (it gets a rank of NA). For me, at least, exclusion of missing values from rankings is the most common desired case.

@hadley
Copy link
Member

hadley commented Nov 14, 2014

I can't see how na.last = NA could work for ranking functions without breaking the interface that length(output) == length(input).

Could you provide a use case for why you'd want to rank NA's higher than every other value?

@gvfarns
Copy link
Author

gvfarns commented Nov 14, 2014

For na.last=NA we don't remove observations, we just give them a ranking of NA. For example, my workaround when I learned the dplyr function ntile() would not handle NA as I needed it to was the following function

ntile_na <- function(x,n)
{
  notna <- !is.na(x)
  out <- rep(NA_real_,length(x))
  out[notna] <- ntile(x[notna],n)
  return(out)
}

The other rank functions are amenable to a similar workaround.

With respect to use cases, it's a question of how to interpret NA values, which the researcher knows but as developers, we don't. Here are some examples:

  1. If X is a set of times to completion, then NA could mean it never completed. Makes sense to group NA with the highest decile.
  2. if X is the score a student got on a test, then NA could mean they did not take it. Might make sense to group those students with those who got the lowest score.
  3. if X is a characteristic and some individuals did not report it, it makes sense not give those individuals a ranking of NA, meaning "unknown". My current work involves an analysis of hedge fund returns and flows and other characteristics. Hedge funds occasionally do not report their flows, so we can't rank them relative to their peers in those months. They may report one characteristic but not another so we don't necessarily want to drop the whole observation from the ranking process.

We can't reasonably impose our view that the data should be in the form of case 1 since as developers we don't know what the data means nor the correct way to interpret NA for that data.

@hadley
Copy link
Member

hadley commented Nov 14, 2014

Oh, I see - I think that na.last = NA should be the standard behaviour for all ranking functions. You can get the other behaviours by trivially modifying the data (e.g. to use -Inf or Inf).

@gvfarns
Copy link
Author

gvfarns commented Nov 14, 2014

I agree completely. However, you might consider keeping the default as it is now to be compatible with the rank() function (and therefore many people's expectations), which has default behavior that matches that of the dplyr ranking functions. I'll leave it up to you to make that call.

@hadley
Copy link
Member

hadley commented Nov 18, 2014

@romainfrancois does the hybrid evaluator also have implementations of these functions that need to be updated?

@hadley hadley added the feature a feature request or enhancement label Nov 20, 2014
@hadley hadley added this to the 0.3.1 milestone Nov 20, 2014
@romainfrancois
Copy link
Member

Yes.

        handlers[ Rf_install( "min_rank" )       ] = rank_impl_prototype<dplyr::internal::min_rank_increment> ;
        handlers[ Rf_install( "percent_rank" )   ] = rank_impl_prototype<dplyr::internal::percent_rank_increment> ;
        handlers[ Rf_install( "dense_rank" )     ] = rank_impl_prototype<dplyr::internal::dense_rank_increment> ;
        handlers[ Rf_install( "cume_dist" )      ] = rank_impl_prototype<dplyr::internal::cume_dist_increment> ;

        handlers[ Rf_install( "ntile" )          ] = ntile_prototype ;

The nice thing is that they all share the same implementation, rank_impl_prototype so we would only need to handle it once.

@romainfrancois
Copy link
Member

However, for all of the functions handled by the rank_impl_prototype template, hybrid evaluation only occurs if the number of arguments is one, so for "correct" handling of any extra argument, it can happen at the R layer.

Anyhow, what should the interface be for min_rank etc ... ?

min_rank and dense_rank could be sort of easy to modify. Bu perhaps cume_dist and percent_rank are more of a problem as we sort of need to know how many NA there are.

It looks as if the R versions of the functions did propagate the NA:

> x <- c(1,2,NA,1,0, NA)
> percent_rank(x)
[1] 0.2 0.6  NA 0.2 0.0  NA
> dense_rank(x)
[1]  2  3 NA  2  1 NA
> min_rank(x)
[1]  2  4 NA  2  1 NA
> cume_dist(x)
[1] 0.5000000 0.6666667        NA 0.5000000 0.1666667        NA

> ( data.frame( x = x ) %>% mutate( out = percent_rank(x) ) ) $out
[1] 0.2 0.6 0.8 0.2 0.0 0.8
> ( data.frame( x = x ) %>% mutate( out = dense_rank(x) ) ) $out
[1] 2 3 4 2 1 4
> ( data.frame( x = x ) %>% mutate( out = min_rank(x) ) ) $out
[1] 2 4 5 2 1 5
> ( data.frame( x = x ) %>% mutate( out = cume_dist(x) ) ) $out
[1] 0.5000000 0.6666667 1.0000000 0.5000000 0.1666667 1.0000000

So should I mimic those results, or do we need some ways to control NA, ...

For cume_dist and percent_rank the R functions use the total number of elements from the vector and not the number of non NA as was hinted by @gvfarns above. What the R functions are doing would be easier to implement internally, but is this the right results @hadley ?

@hadley
Copy link
Member

hadley commented Dec 10, 2014

Oh oops, I'd say that's a bug in my R implementation. The denominator should be the number of non-NAs, not the length.

We don't need options to control behave, just ensure that NAs in input are NA in output

@romainfrancois
Copy link
Member

Did the first set of functions: min_rank, percent_rank, dense_rank and cume_dist. Also I think I fixed the R implementation of them.

I guess next is ntile

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants