Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Required data.table arguments for the filtering verbs #17

Closed
leungi opened this issue Jul 6, 2019 · 9 comments
Closed

Required data.table arguments for the filtering verbs #17

leungi opened this issue Jul 6, 2019 · 9 comments

Comments

@leungi
Copy link

leungi commented Jul 6, 2019

Reprex below.

Goal: keep filter_on() consistent with data.table filter-on, with and without index set on DT - refer to ex1 and ex2 below.

Proposal: add mult argument to filter_sd() or alter filter_on() to honour index set in DT; leave filter as-is since it's imported from dplyr.

# install latest version
# devtools::install_github('asardaes/table.express')

library(dplyr)
library(table.express)
library(data.table)

## Create a data table
DT <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                 V2 = 1:9,
                 V3 = c(0.5, 1.0, 1.5),
                 V4 = rep(LETTERS[1:3], 3))

## without index
DT[c("B", "C"), on = "V4", mult = "first"]
#>    V1 V2  V3 V4
#> 1:  2  2 1.0  B
#> 2:  1  3 1.5  C

# ex1 : works
DT %>%
  start_expr() %>% 
  filter_on(V4 = c("B", "C"), mult = "first") %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  2  2 1.0  B
#> 2:  1  3 1.5  C

## with index
setkey(DT, V4)
setindex(DT, V4)

DT[c("B", "C"), mult = "first"]
#>    V1 V2  V3 V4
#> 1:  2  2 1.0  B
#> 2:  1  3 1.5  C

# ex2: fails
DT %>%
  start_expr() %>% 
  filter_on(c("B", "C"), mult = "first") %>%
  end_expr() %>% {
    invisible(print(.))
  }
#> All arguments in '...' must be named.

DT %>%
  start_expr() %>% 
  filter(c("B", "C")) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  2  2 1.0  B
#> 2:  1  5 1.0  B
#> 3:  2  8 1.0  B
#> 4:  1  3 1.5  C
#> 5:  2  6 1.5  C
#> 6:  1  9 1.5  C

DT %>%
  start_expr() %>% 
  filter_sd(.SDcols = 'V4', c("B", "C")) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  2  2 1.0  B
#> 2:  1  5 1.0  B
#> 3:  2  8 1.0  B
#> 4:  1  3 1.5  C
#> 5:  2  6 1.5  C
#> 6:  1  9 1.5  C

Created on 2019-07-06 by the reprex package (v0.2.1)

@asardaes
Copy link
Owner

asardaes commented Jul 6, 2019

It would be possible to add parameters to filter because the generic has ..., so R allows it. However, I would prefer changing filter_on to allow empty names. It would be consistent with data.table in that if you specify on, you either provide all names, or none.

@leungi
Copy link
Author

leungi commented Jul 6, 2019

Noted.

Altering filter_on ideal so learners may apply logic back to native data.table if needed.

@leungi
Copy link
Author

leungi commented Jul 6, 2019

Related to above, regarding nomatch on data.table filter,

filter_sd seems to set nomatch = NA by default (see scenario 1 below); add nomatch as argument so that there's a table.express equivalent for nomatch = 0.

This may be redundant if intent is to use filter_on to mimic data.table.

# install latest version
# devtools::install_github('asardaes/table.express')

library(dplyr)
library(table.express)
library(data.table)

## Create a data table
DT <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                 V2 = 1:9,
                 V3 = c(0.5, 1.0, 1.5),
                 V4 = rep(LETTERS[1:3], 3))

setkey(DT, V4)

# scenario 1
DT[c("A", "D"), on = "V4", nomatch = NA]
#>    V1 V2  V3 V4
#> 1:  1  1 0.5  A
#> 2:  2  4 0.5  A
#> 3:  1  7 0.5  A
#> 4: NA NA  NA  D

DT %>%
  start_expr() %>% 
  filter_on(V4 = c("A", "D"), nomatch = NA) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  1  1 0.5  A
#> 2:  2  4 0.5  A
#> 3:  1  7 0.5  A
#> 4: NA NA  NA  D

DT %>%
  start_expr() %>% 
  filter_sd(.SDcols = 'V4', c("A", "D")) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  1  1 0.5  A
#> 2:  2  4 0.5  A
#> 3:  1  7 0.5  A
#> 4: NA NA  NA  D

# scenario 2
DT[c("A", "D"), on = "V4", nomatch = 0]
#>    V1 V2  V3 V4
#> 1:  1  1 0.5  A
#> 2:  2  4 0.5  A
#> 3:  1  7 0.5  A

DT %>%
  start_expr() %>% 
  filter_on(V4 = c("A", "D"), nomatch = 0) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  1  1 0.5  A
#> 2:  2  4 0.5  A
#> 3:  1  7 0.5  A

Created on 2019-07-06 by the reprex package (v0.2.1)

asardaes added a commit that referenced this issue Jul 6, 2019
@asardaes
Copy link
Owner

asardaes commented Jul 6, 2019

Well yes and no. Both filter and filter_sd simply add expressions to i without caring about nomatch. The fact that passing a single expression ends up recognized by data.table wasn't entirely planned. I can add the nomatch parameter to both, thought it might be preferable to support that only in filter_on so that every time you read your code, you can search for filters where keys where used, maybe implicitly, by searching for filter_on.

You could still write something like this if you really wanted to use one of those verbs:

DT %>%
    start_expr %>%
    filter(c("A", "D")) %>%
    frame_append(nomatch = NULL) %>%
    end_expr

As a side note: try to stay away from nomatch = 0, see Rdatatable/data.table#857.

@leungi
Copy link
Author

leungi commented Jul 6, 2019

Noted; thanks for reference.

Looks like data.table nomatch will look like tidyr::replace_na.

I agree making filter_on the one-go-to to match data.table filter is most ideal.

Another use case current filter_on seems to have no equivalent - handling multi index/keys.

setkey(DT, V4, V1)

DT[.(c("B", "C"), 1)]

# works, by using data.table syntax
DT %>%
  start_expr() %>% 
  filter(.("C", 1)) %>%
  end_expr() %>% {
    invisible(print(.))
  }

@asardaes
Copy link
Owner

asardaes commented Jul 6, 2019

That could work. With the new filter_on you'd simply write filter_on(c("B", "C"), 1), and otherwise filter(.(c("B", "C"), 1))

@leungi
Copy link
Author

leungi commented Jul 6, 2019

Aye 👍

Apologies for keep expanding the issues; another one to consider which argument.

Closest solution: create a row index first

# with dual indices
# returns the matching rows indices of original DT
DT[.(c("B", "C"), 1), which = TRUE]
#> [1] 4 7 8

DT %>%
  start_expr() %>% 
  mutate(n = .I) %>% 
  chain %>% 
  filter(.(c("B", "C"), 1)) %>% 
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4 n
#> 1:  1  5 1.0  B 4
#> 2:  1  3 1.5  C 7
#> 3:  1  9 1.5  C 8

@asardaes
Copy link
Owner

asardaes commented Jul 6, 2019

Oh, that one did slip my mind. Yeah, I'll add that.

@asardaes asardaes changed the title Feature request: mult argument in filter_sd() Required data.table arguments for the filtering verbs Jul 6, 2019
asardaes added a commit that referenced this issue Jul 6, 2019
@leungi
Copy link
Author

leungi commented Jul 7, 2019

👏

# install latest version
# devtools::install_github('asardaes/table.express')

suppressMessages(library(dplyr))
suppressMessages(library(table.express))
suppressMessages(library(data.table))
#> Warning: package 'data.table' was built under R version 3.5.3

## Create a data table
DT <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                 V2 = 1:9,
                 V3 = c(0.5, 1.0, 1.5),
                 V4 = rep(LETTERS[1:3], 3))

setkey(DT, V4, V1)

# multi-key
DT[.(c("B", "C"), 1)]
#>    V1 V2  V3 V4
#> 1:  1  5 1.0  B
#> 2:  1  3 1.5  C
#> 3:  1  9 1.5  C

# using filter()
DT %>%
  start_expr() %>% 
  filter(.(c("B", "C"), 1)) %>% 
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  1  5 1.0  B
#> 2:  1  3 1.5  C
#> 3:  1  9 1.5  C

# using filter_on()
DT %>%
  start_expr() %>% 
  filter_on(c("B", "C"), 1) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#>    V1 V2  V3 V4
#> 1:  1  5 1.0  B
#> 2:  1  3 1.5  C
#> 3:  1  9 1.5  C

# using which = TRUE only returns the matching rows indices
DT[.(c("B", "C"), 1), on = .(V4, V1), which = TRUE]
#> [1] 4 7 8

# which argument in data.table filter
DT %>%
  start_expr() %>% 
  filter_on(c("B", "C"), 1, which = TRUE) %>%
  end_expr() %>% {
    invisible(print(.))
  }
#> [1] 4 7 8

Created on 2019-07-07 by the reprex package (v0.2.1)

@leungi leungi closed this as completed Jul 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants