`slice_sample()` errors if `n` bigger than number of rows #6185

swannyy14 · 2022-02-11T22:07:45Z

In the man page for slice functions, the description for the argument n states:

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. If the proportion of a group size does not yield an integer number of rows, the absolute value of prop*nrow(.data) is rounded down.

The output of slice_sample used to be the same data.frame (with different ordering) if n is higher than the number of rows, but it is now returning an error.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tmp <- data.frame(x = c(rep('a',6), rep('b',4)), y = 1:10)
print(tmp)
#>    x  y
#> 1  a  1
#> 2  a  2
#> 3  a  3
#> 4  a  4
#> 5  a  5
#> 6  a  6
#> 7  b  7
#> 8  b  8
#> 9  b  9
#> 10 b 10

tmp %>% slice_sample(n = 15)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> Caused by error in `sample.int()`:
#> ! cannot take a sample larger than the population when 'replace = FALSE'

tmp %>% group_by(x) %>% slice_sample(n = 15)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> ℹ The error occurred in group 1: x = "a".
#> Caused by error in `sample.int()`:
#> ! cannot take a sample larger than the population when 'replace = FALSE'

^{Created on 2022-02-11 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

romanzenka · 2022-02-15T19:23:28Z

I had to write my own wrapper around slice_sample. I use it to show 5 example values to user (or less), and this update broke the functionality. I had to add extra logic to handle sampling of fewer samples than present, and also handling sampling 0 samples from data set of 0 rows...

One note: if you go from 1.0.7 -> 1.0.8, I would not expect breaking changes in the API. This is a breaking change, as my code that previously worked okay now ceased working.

swannyy14 · 2022-02-16T15:24:05Z

I have the same problem. I have a package that randomly samples n rows from each grouping of a data.frame, and right now the package is failing to build do to this change. I used this function due to its handling of n being greater than the number of rows from each group in the data.frame. I hope they fix this soon.

fegue · 2022-02-17T07:50:36Z

This seems to be introduced on purpose because there is a test specifically for this error. The documentation however was not updated.
So my question would be: should we wait for this regression to be fixed or is it time to re-write the code?

swannyy14 · 2022-02-17T15:23:43Z

That's interesting. I also see the test change... In that case, it seems like the change in behavior is on purpose and I'm strongly considering re-writing my own code. I hope that the description for the argument is updated so that there is no ambiguity.

huftis · 2022-03-22T14:36:18Z

slice_sample() now also results in an error when n = Inf`:

> slice_sample(iris, n = Inf)
Error in `slice_sample()`:
! Problem while computing indices.
Caused by error in `sample.int()`:
! vector size cannot be infinite
Run `rlang::last_error()` to see where the error occurred.

This used to work, and would permute the rows, which was very useful. But we can use prop = 1 instead.

BTW, values of prop > 1 now also results in an error message (this is not consistent with the documentation).

willschulz · 2022-04-01T20:43:37Z

Definitely need to retain the option to sample "n, or however many rows exist, whichever is smaller". Potentially good to change it from silently doing this behavior without warning the user, but a lot of us have code that depends on the function not throwing an error in this situation

hadley · 2022-04-15T23:22:13Z

Regression introduced in #6172 — the PR doesn't explicitly mention this change and there's no NEWS bullet so I suspect it's an omission, not a deliberate change.

dtatarak · 2022-07-06T15:38:51Z

Has this been addressed? I did not have this issue until recently, but now it is happening for me as well.

hadley · 2022-07-21T16:21:30Z

Reading this thread through more carefully, I think this is mainly a documentation issue — if you want to sample more rows than exist, you do need to set replace = TRUE as the error suggests.

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(x = 1:3)

df %>% slice_sample(n = 4)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> Caused by error:
#> ! Can't sample without replacement using a size that is larger than the
#>   number of rows in the data.
#> ℹ 4 rows were requested in the sample.
#> ℹ 3 rows are present in the data.
#> ℹ Set `replace = TRUE` to sample with replacement.
df %>% slice_sample(n = 4, replace = TRUE)
#>   x
#> 1 1
#> 2 1
#> 3 3
#> 4 1

^{Created on 2022-07-21 by the reprex package (v2.0.1)}

dtatarak · 2022-07-21T16:34:25Z

The issue is: I don't want to sample with replacement in the samples where size > n. If I take a random sample of 100 from samples of: 94 125 200 80 I want to get back: 94 100 100 80 With no replacement in the larger samples. This used to be how slice_sample() behaved, but it seems to have changed for some reason. Best, David Tatarakis

…

On Thu, Jul 21, 2022 at 12:21 PM Hadley Wickham ***@***.***> wrote: Reading this thread through more carefully, I think this is mainly a documentation issue — if you want to sample more rows than exist, you do need to set replace = TRUE as the error suggests. Minimal reprex: library(dplyr, warn.conflicts = FALSE) df <- data.frame(x = 1:3) df %>% slice_sample(n = 4) #> Error in `slice_sample()`: #> ! Problem while computing indices. #> Caused by error: #> ! Can't sample without replacement using a size that is larger than the #> number of rows in the data. #> ℹ 4 rows were requested in the sample. #> ℹ 3 rows are present in the data. #> ℹ Set `replace = TRUE` to sample with replacement. df %>% slice_sample(n = 4, replace = TRUE) #> x #> 1 1 #> 2 1 #> 3 3 #> 4 1 Created on 2022-07-21 by the reprex package <https://reprex.tidyverse.org> (v2.0.1) — Reply to this email directly, view it on GitHub <#6185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKOM2PGMCFLND7VEVAVXQ3LVVF2JXANCNFSM5OFO6JNA> . You are receiving this because you commented.Message ID: ***@***.***>

hadley · 2022-07-21T16:42:32Z

@dtatarak that sometimes sampling with replacement and sometimes sampling without replacement seems ill-founded to me.

eutwt · 2022-07-21T16:50:52Z

The replacement is still controlled by the replace argument in the behavior being proposed by this issue. The desired behavior is that the slice size is automatically capped at n() like other slice() functions (and like the previous behavior), not that it auto-switches to replace = TRUE to produce an output with more than n() rows.

I think it would make sense for all the slice() functions to return a data frame (or group) of the same size when you supply an argument indicating a size larger than the number of rows. slice_sample() is the odd one out here.

library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2
df <- data.frame(a = 1)

df %>% 
  slice(1:2)
#>   a
#> 1 1

df %>% 
  slice_head(n = 2)
#>   a
#> 1 1

df %>% 
  slice_tail(n = 2)
#>   a
#> 1 1

df %>% 
  slice_min(n = 2, order_by = a)
#>   a
#> 1 1

df %>% 
  slice_max(n = 2, order_by = a)
#>   a
#> 1 1

df %>% 
  slice_sample(n = 2)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> Caused by error in `sample.int()`:
#> ! cannot take a sample larger than the population when 'replace = FALSE'

^{Created on 2022-07-21 by the reprex package (v2.0.1)}

dtatarak · 2022-07-21T16:59:12Z

I agree. Definitely not something I do generally, but I have a specific use-case for it in my work. It's useful to have a function that says "Give me a random sample that's this size without replacement. If there aren't enough, just give me all of them." I can certainly do that manually, but slice_sample() was a very convenient way of doing that within a larger dplyr statement. Best, David Tatarakis

…

On Thu, Jul 21, 2022 at 12:42 PM Hadley Wickham ***@***.***> wrote: @dtatarak <https://github.com/dtatarak> that sometimes sampling with replacement and sometimes sampling without replacement seems ill-founded to me. — Reply to this email directly, view it on GitHub <#6185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKOM2PGD7S2M2LE6YRC5YL3VVF4YJANCNFSM5OFO6JNA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lawalter · 2022-07-21T17:29:11Z

Similar to @dtatarak, I also had (and still have) a use case for sampling more rows than exist in a table. I wrote a function for reducing partial duplicates down to one record, and if the records were "tied" I used slice_sample(n = 1) to pick one from each group randomly. If there were no duplicates, the function still operated fine. But slice_sample() changes broke the function and I now must use an if else wrapper to only slice if nrow() > 0.

leejs-abv · 2022-07-21T17:44:19Z

The way n is documented in slice functions right now, this behavior in slice_sample looks like a bug.

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size.

Maybe clarifying this behavior for slice_sample could minimize the confusion?
Personally, I would like slice_sample to return the way dtatarak has described - to return the original data frame with different ordering. It would also be consistent with other slice functions as eutwt has described.

hadley · 2022-07-21T19:28:32Z

@eutwt ah that makes sense. Do you want to have a go at a PR?

…SE` (#6351) Fixes #6185

kraf2149 · 2022-10-13T18:26:42Z

Has this bug been resolved in 1.0.10? I am still receiving the aforementioned error when using slice_sample() for resampling to a fixed count in instances where some groups have less observations than specified by n.

DavisVaughan · 2022-10-13T19:00:43Z

No, but it is fixed in the development version.

erydit · 2022-11-29T12:04:20Z

This bug is almost a year old. When the fix would reach CRAN ? Is there any release schedule or something?

DavisVaughan · 2022-11-29T14:32:07Z

@erydit https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/

eutwt · 2022-11-29T14:49:33Z

@erydit FWIW if you need this right away and can't install the dev package, this is pretty easy to "patch" by redefining dplyr:::sample_int with the code below. This is of course a bad idea in general, but may be useful as a temporary fix for development while waiting for the new release.

I should probably mention I don't work at Rstudio and this hacky patch is not recommended by them I'd guess :)

sample_int <- function(n, size, replace = FALSE, wt = NULL, call = caller_env()) {
  if (!replace && n < size) {
   size <- n
  }

  if (size == 0L) {
    integer(0)
  } else {
    sample.int(n, size, prob = wt, replace = replace)
  }
}

assignInNamespace('sample_int', sample_int, 'dplyr')

`slice_sample(..., n = Inf)` fungerer ikkje lenger, til trass for at dokumentasjonen seier at det skal det. Må derfor heller bruka argumentet `prop = 1` (som strengt tatt òg er meir logisk). Sjå relatert feilrapport: tidyverse/dplyr#6185

swannyy14 changed the title ~~slice_sample() now returns error when "n" is bigger than the number of existing rows.~~ slice_sample() now returns error when "n" is bigger than the number of existing rows. Feb 16, 2022

hadley changed the title ~~slice_sample() now returns error when "n" is bigger than the number of existing rows.~~ slice_sample() errors if "n" bigger than number of rows Apr 15, 2022

hadley changed the title ~~slice_sample() errors if "n" bigger than number of rows~~ slice_sample() errors if n bigger than number of rows Apr 15, 2022

hadley added bug an unexpected problem or unintended behavior rows ↕️ Operations on rows: filter(), slice(), arrange() labels Apr 15, 2022

hadley added documentation and removed bug an unexpected problem or unintended behavior labels Jul 21, 2022

hadley added bug an unexpected problem or unintended behavior and removed documentation labels Jul 21, 2022

eutwt mentioned this issue Jul 21, 2022

Silently cap slice_sample() slice size at n() when replace = FALSE #6351

Merged

hadley closed this as completed in #6351 Jul 22, 2022

hadley pushed a commit that referenced this issue Jul 22, 2022

Silently cap slice_sample() slice size at n() when `replace = FAL…

e47787f

…SE` (#6351) Fixes #6185

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`slice_sample()` errors if `n` bigger than number of rows #6185

`slice_sample()` errors if `n` bigger than number of rows #6185

swannyy14 commented Feb 11, 2022

romanzenka commented Feb 15, 2022 •

edited

Loading

swannyy14 commented Feb 16, 2022

fegue commented Feb 17, 2022

swannyy14 commented Feb 17, 2022 •

edited

Loading

huftis commented Mar 22, 2022

willschulz commented Apr 1, 2022

hadley commented Apr 15, 2022 •

edited

Loading

dtatarak commented Jul 6, 2022

hadley commented Jul 21, 2022

dtatarak commented Jul 21, 2022 via email

hadley commented Jul 21, 2022

eutwt commented Jul 21, 2022 •

edited

Loading

dtatarak commented Jul 21, 2022 via email

lawalter commented Jul 21, 2022

leejs-abv commented Jul 21, 2022

hadley commented Jul 21, 2022

kraf2149 commented Oct 13, 2022

DavisVaughan commented Oct 13, 2022

erydit commented Nov 29, 2022

DavisVaughan commented Nov 29, 2022

eutwt commented Nov 29, 2022 •

edited

Loading

slice_sample() errors if n bigger than number of rows #6185

slice_sample() errors if n bigger than number of rows #6185

Comments

swannyy14 commented Feb 11, 2022

romanzenka commented Feb 15, 2022 • edited Loading

swannyy14 commented Feb 16, 2022

fegue commented Feb 17, 2022

swannyy14 commented Feb 17, 2022 • edited Loading

huftis commented Mar 22, 2022

willschulz commented Apr 1, 2022

hadley commented Apr 15, 2022 • edited Loading

dtatarak commented Jul 6, 2022

hadley commented Jul 21, 2022

dtatarak commented Jul 21, 2022 via email

hadley commented Jul 21, 2022

eutwt commented Jul 21, 2022 • edited Loading

dtatarak commented Jul 21, 2022 via email

lawalter commented Jul 21, 2022

leejs-abv commented Jul 21, 2022

hadley commented Jul 21, 2022

kraf2149 commented Oct 13, 2022

DavisVaughan commented Oct 13, 2022

erydit commented Nov 29, 2022

DavisVaughan commented Nov 29, 2022

eutwt commented Nov 29, 2022 • edited Loading

`slice_sample()` errors if `n` bigger than number of rows #6185

`slice_sample()` errors if `n` bigger than number of rows #6185

romanzenka commented Feb 15, 2022 •

edited

Loading

swannyy14 commented Feb 17, 2022 •

edited

Loading

hadley commented Apr 15, 2022 •

edited

Loading

eutwt commented Jul 21, 2022 •

edited

Loading

eutwt commented Nov 29, 2022 •

edited

Loading