Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice_sample() errors if n bigger than number of rows #6185

Closed
swannyy14 opened this issue Feb 11, 2022 · 21 comments · Fixed by #6351
Closed

slice_sample() errors if n bigger than number of rows #6185

swannyy14 opened this issue Feb 11, 2022 · 21 comments · Fixed by #6351
Labels
bug an unexpected problem or unintended behavior rows ↕️ Operations on rows: filter(), slice(), arrange()

Comments

@swannyy14
Copy link

In the man page for slice functions, the description for the argument n states:

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. If the proportion of a group size does not yield an integer number of rows, the absolute value of prop*nrow(.data) is rounded down.

The output of slice_sample used to be the same data.frame (with different ordering) if n is higher than the number of rows, but it is now returning an error.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tmp <- data.frame(x = c(rep('a',6), rep('b',4)), y = 1:10)
print(tmp)
#>    x  y
#> 1  a  1
#> 2  a  2
#> 3  a  3
#> 4  a  4
#> 5  a  5
#> 6  a  6
#> 7  b  7
#> 8  b  8
#> 9  b  9
#> 10 b 10

tmp %>% slice_sample(n = 15)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> Caused by error in `sample.int()`:
#> ! cannot take a sample larger than the population when 'replace = FALSE'

tmp %>% group_by(x) %>% slice_sample(n = 15)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> ℹ The error occurred in group 1: x = "a".
#> Caused by error in `sample.int()`:
#> ! cannot take a sample larger than the population when 'replace = FALSE'

Created on 2022-02-11 by the reprex package (v2.0.1)

@romanzenka
Copy link

romanzenka commented Feb 15, 2022

I had to write my own wrapper around slice_sample. I use it to show 5 example values to user (or less), and this update broke the functionality. I had to add extra logic to handle sampling of fewer samples than present, and also handling sampling 0 samples from data set of 0 rows...

One note: if you go from 1.0.7 -> 1.0.8, I would not expect breaking changes in the API. This is a breaking change, as my code that previously worked okay now ceased working.

@swannyy14
Copy link
Author

I have the same problem. I have a package that randomly samples n rows from each grouping of a data.frame, and right now the package is failing to build do to this change. I used this function due to its handling of n being greater than the number of rows from each group in the data.frame. I hope they fix this soon.

@swannyy14 swannyy14 changed the title slice_sample() now returns error when "n" is bigger than the number of existing rows. slice_sample() now returns error when "n" is bigger than the number of existing rows. Feb 16, 2022
@fegue
Copy link

fegue commented Feb 17, 2022

This seems to be introduced on purpose because there is a test specifically for this error. The documentation however was not updated.
So my question would be: should we wait for this regression to be fixed or is it time to re-write the code?

@swannyy14
Copy link
Author

swannyy14 commented Feb 17, 2022

That's interesting. I also see the test change... In that case, it seems like the change in behavior is on purpose and I'm strongly considering re-writing my own code. I hope that the description for the argument is updated so that there is no ambiguity.

@huftis
Copy link

huftis commented Mar 22, 2022

slice_sample() now also results in an error when n = Inf`:

> slice_sample(iris, n = Inf)
Error in `slice_sample()`:
! Problem while computing indices.
Caused by error in `sample.int()`:
! vector size cannot be infinite
Run `rlang::last_error()` to see where the error occurred.

This used to work, and would permute the rows, which was very useful. But we can use prop = 1 instead.

BTW, values of prop > 1 now also results in an error message (this is not consistent with the documentation).

@willschulz
Copy link

Definitely need to retain the option to sample "n, or however many rows exist, whichever is smaller". Potentially good to change it from silently doing this behavior without warning the user, but a lot of us have code that depends on the function not throwing an error in this situation

@hadley
Copy link
Member

hadley commented Apr 15, 2022

Regression introduced in #6172 — the PR doesn't explicitly mention this change and there's no NEWS bullet so I suspect it's an omission, not a deliberate change.

@hadley hadley changed the title slice_sample() now returns error when "n" is bigger than the number of existing rows. slice_sample() errors if "n" bigger than number of rows Apr 15, 2022
@hadley hadley changed the title slice_sample() errors if "n" bigger than number of rows slice_sample() errors if n bigger than number of rows Apr 15, 2022
@hadley hadley added bug an unexpected problem or unintended behavior rows ↕️ Operations on rows: filter(), slice(), arrange() labels Apr 15, 2022
@dtatarak
Copy link

dtatarak commented Jul 6, 2022

Has this been addressed? I did not have this issue until recently, but now it is happening for me as well.

@hadley
Copy link
Member

hadley commented Jul 21, 2022

Reading this thread through more carefully, I think this is mainly a documentation issue — if you want to sample more rows than exist, you do need to set replace = TRUE as the error suggests.

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(x = 1:3)

df %>% slice_sample(n = 4)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> Caused by error:
#> ! Can't sample without replacement using a size that is larger than the
#>   number of rows in the data.
#> ℹ 4 rows were requested in the sample.
#> ℹ 3 rows are present in the data.
#> ℹ Set `replace = TRUE` to sample with replacement.
df %>% slice_sample(n = 4, replace = TRUE)
#>   x
#> 1 1
#> 2 1
#> 3 3
#> 4 1

Created on 2022-07-21 by the reprex package (v2.0.1)

@hadley hadley added documentation and removed bug an unexpected problem or unintended behavior labels Jul 21, 2022
@dtatarak
Copy link

dtatarak commented Jul 21, 2022 via email

@hadley
Copy link
Member

hadley commented Jul 21, 2022

@dtatarak that sometimes sampling with replacement and sometimes sampling without replacement seems ill-founded to me.

@eutwt
Copy link
Contributor

eutwt commented Jul 21, 2022

The replacement is still controlled by the replace argument in the behavior being proposed by this issue. The desired behavior is that the slice size is automatically capped at n() like other slice() functions (and like the previous behavior), not that it auto-switches to replace = TRUE to produce an output with more than n() rows.

I think it would make sense for all the slice() functions to return a data frame (or group) of the same size when you supply an argument indicating a size larger than the number of rows. slice_sample() is the odd one out here.

library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2
df <- data.frame(a = 1)

df %>% 
  slice(1:2)
#>   a
#> 1 1

df %>% 
  slice_head(n = 2)
#>   a
#> 1 1

df %>% 
  slice_tail(n = 2)
#>   a
#> 1 1

df %>% 
  slice_min(n = 2, order_by = a)
#>   a
#> 1 1

df %>% 
  slice_max(n = 2, order_by = a)
#>   a
#> 1 1

df %>% 
  slice_sample(n = 2)
#> Error in `slice_sample()`:
#> ! Problem while computing indices.
#> Caused by error in `sample.int()`:
#> ! cannot take a sample larger than the population when 'replace = FALSE'

Created on 2022-07-21 by the reprex package (v2.0.1)

@dtatarak
Copy link

dtatarak commented Jul 21, 2022 via email

@lawalter
Copy link

Similar to @dtatarak, I also had (and still have) a use case for sampling more rows than exist in a table. I wrote a function for reducing partial duplicates down to one record, and if the records were "tied" I used slice_sample(n = 1) to pick one from each group randomly. If there were no duplicates, the function still operated fine. But slice_sample() changes broke the function and I now must use an if else wrapper to only slice if nrow() > 0.

@leejs-abv
Copy link

The way n is documented in slice functions right now, this behavior in slice_sample looks like a bug.

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size.

Maybe clarifying this behavior for slice_sample could minimize the confusion?
Personally, I would like slice_sample to return the way dtatarak has described - to return the original data frame with different ordering. It would also be consistent with other slice functions as eutwt has described.

@hadley
Copy link
Member

hadley commented Jul 21, 2022

@eutwt ah that makes sense. Do you want to have a go at a PR?

@hadley hadley added bug an unexpected problem or unintended behavior and removed documentation labels Jul 21, 2022
@kraf2149
Copy link

Has this bug been resolved in 1.0.10? I am still receiving the aforementioned error when using slice_sample() for resampling to a fixed count in instances where some groups have less observations than specified by n.

@DavisVaughan
Copy link
Member

No, but it is fixed in the development version.

@erydit
Copy link

erydit commented Nov 29, 2022

This bug is almost a year old. When the fix would reach CRAN ? Is there any release schedule or something?

@DavisVaughan
Copy link
Member

@erydit https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/

@eutwt
Copy link
Contributor

eutwt commented Nov 29, 2022

@erydit FWIW if you need this right away and can't install the dev package, this is pretty easy to "patch" by redefining dplyr:::sample_int with the code below. This is of course a bad idea in general, but may be useful as a temporary fix for development while waiting for the new release.

I should probably mention I don't work at Rstudio and this hacky patch is not recommended by them I'd guess :)

sample_int <- function(n, size, replace = FALSE, wt = NULL, call = caller_env()) {
  if (!replace && n < size) {
   size <- n
  }

  if (size == 0L) {
    integer(0)
  } else {
    sample.int(n, size, prob = wt, replace = replace)
  }
}

assignInNamespace('sample_int', sample_int, 'dplyr')

haugedal pushed a commit to Rapporteket/rapwhale that referenced this issue Feb 16, 2023
`slice_sample(..., n = Inf)` fungerer ikkje lenger, til trass for at
dokumentasjonen seier at det skal det. Må derfor heller bruka argumentet
`prop = 1` (som strengt tatt òg er meir logisk).

Sjå relatert feilrapport:
tidyverse/dplyr#6185
haugedal pushed a commit to Rapporteket/rapwhale that referenced this issue Feb 22, 2023
`slice_sample(..., n = Inf)` fungerer ikkje lenger, til trass for at
dokumentasjonen seier at det skal det. Må derfor heller bruka argumentet
`prop = 1` (som strengt tatt òg er meir logisk).

Sjå relatert feilrapport:
tidyverse/dplyr#6185
haugedal pushed a commit to Rapporteket/rapwhale that referenced this issue Feb 27, 2023
`slice_sample(..., n = Inf)` fungerer ikkje lenger, til trass for at
dokumentasjonen seier at det skal det. Må derfor heller bruka argumentet
`prop = 1` (som strengt tatt òg er meir logisk).

Sjå relatert feilrapport:
tidyverse/dplyr#6185
haugedal pushed a commit to Rapporteket/rapwhale that referenced this issue Feb 27, 2023
`slice_sample(..., n = Inf)` fungerer ikkje lenger, til trass for at
dokumentasjonen seier at det skal det. Må derfor heller bruka argumentet
`prop = 1` (som strengt tatt òg er meir logisk).

Sjå relatert feilrapport:
tidyverse/dplyr#6185
haugedal pushed a commit to Rapporteket/rapwhale that referenced this issue Feb 27, 2023
`slice_sample(..., n = Inf)` fungerer ikkje lenger, til trass for at
dokumentasjonen seier at det skal det. Må derfor heller bruka argumentet
`prop = 1` (som strengt tatt òg er meir logisk).

Sjå relatert feilrapport:
tidyverse/dplyr#6185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior rows ↕️ Operations on rows: filter(), slice(), arrange()
Projects
None yet
Development

Successfully merging a pull request may close this issue.