Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some operations are much slower than base #295

Closed
wch opened this issue Sep 18, 2020 · 3 comments
Closed

Some operations are much slower than base #295

wch opened this issue Sep 18, 2020 · 3 comments

Comments

@wch
Copy link
Member

wch commented Sep 18, 2020

For example:

library(fs)
system.time({
  for (i in 1:1000) file_exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.523   0.016   0.539

system.time({
  for (i in 1:1000) file.exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.002   0.001   0.003

Similarly, replacing an instance of dir_copy with equivalent base code results in a significant speedup. For this test, the time spent in the copying code went from 370ms to 10ms. This is a real-world case where the speed impact is noticeable. rstudio/sass#53

It is harder to get the code right with base functions, so it would be nice to use fs.

Sorry this isn't more specific about exactly which operations are slow. However, when I profiled it, it looked like a lot of time for file_exists was spent dealing with tibbles. (Note that it had to be profiled on R 3.6, since the profiler in R 4.0 has a bug and doesn't generate useful data.)

@jimhester
Copy link
Member

Performance was not a specific goal of fs.

In general .Call() has non-negligible overhead compared to .Primitive() or .Internal(), so in many cases it may be impossible to fully reach the same performance.

In this particular case the performance difference is exacerbated because the implementation of fs::file_exists() internally uses fs::file_info() which queries much more information than just the file's existence, whereas file.exists() just checks for file existence.

tibble construction does seem to cause non-negligible slowdown, perhaps it would be worth adding an option so it could be disabled for performance critical code.

jimhester added a commit that referenced this issue Sep 18, 2020
Construction of tibbles can add non-negligible overhead to some
operations, so we now support an option to disable them for performance
critical code.

Part of #295
@jimhester
Copy link
Member

In the file_exists() case it was pretty straightforward to have a comparable implementation to file.exists(), so I have done that now.

library(fs)
system.time({
  for (i in 1:1000) file_exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.018   0.002   0.020
system.time({
  for (i in 1:1000) file.exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.002   0.001   0.003

Created on 2020-09-18 by the reprex package (v0.3.0)

I don't think it is really possible to get much faster, as I said we start running into .Call() overhead.

@jimhester
Copy link
Member

dir_copy() is more complicated and there is no direct equivalent to base functions. However in the benchmark you mention you run it for 30 times and it takes 800ms total, so each run is still only ~25ms total. Basically doing any non-trivial operation with the R interpreter usually takes on the order of 1ms, so we would have to implement all the dir_copy logic in C to get much speed benefit I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants