Some operations are much slower than base #295

wch · 2020-09-18T02:16:04Z

For example:

library(fs)
system.time({
  for (i in 1:1000) file_exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.523   0.016   0.539

system.time({
  for (i in 1:1000) file.exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.002   0.001   0.003

Similarly, replacing an instance of dir_copy with equivalent base code results in a significant speedup. For this test, the time spent in the copying code went from 370ms to 10ms. This is a real-world case where the speed impact is noticeable. rstudio/sass#53

It is harder to get the code right with base functions, so it would be nice to use fs.

Sorry this isn't more specific about exactly which operations are slow. However, when I profiled it, it looked like a lot of time for file_exists was spent dealing with tibbles. (Note that it had to be profiled on R 3.6, since the profiler in R 4.0 has a bug and doesn't generate useful data.)

The text was updated successfully, but these errors were encountered:

jimhester · 2020-09-18T13:10:23Z

Performance was not a specific goal of fs.

In general .Call() has non-negligible overhead compared to .Primitive() or .Internal(), so in many cases it may be impossible to fully reach the same performance.

In this particular case the performance difference is exacerbated because the implementation of fs::file_exists() internally uses fs::file_info() which queries much more information than just the file's existence, whereas file.exists() just checks for file existence.

tibble construction does seem to cause non-negligible slowdown, perhaps it would be worth adding an option so it could be disabled for performance critical code.

Construction of tibbles can add non-negligible overhead to some operations, so we now support an option to disable them for performance critical code. Part of #295

jimhester · 2020-09-18T14:14:44Z

In the file_exists() case it was pretty straightforward to have a comparable implementation to file.exists(), so I have done that now.

library(fs)
system.time({
  for (i in 1:1000) file_exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.018   0.002   0.020
system.time({
  for (i in 1:1000) file.exists('DESCRIPTION')
})
#>    user  system elapsed 
#>   0.002   0.001   0.003

^{Created on 2020-09-18 by the reprex package (v0.3.0)}

I don't think it is really possible to get much faster, as I said we start running into .Call() overhead.

jimhester · 2020-09-18T14:19:47Z

dir_copy() is more complicated and there is no direct equivalent to base functions. However in the benchmark you mention you run it for 30 times and it takes 800ms total, so each run is still only ~25ms total. Basically doing any non-trivial operation with the R interpreter usually takes on the order of 1ms, so we would have to implement all the dir_copy logic in C to get much speed benefit I think.

jimhester added a commit that referenced this issue Sep 18, 2020

Allow disabling use of tibbles

67ec165

Construction of tibbles can add non-negligible overhead to some operations, so we now support an option to disable them for performance critical code. Part of #295

jimhester closed this as completed in 6813735 Sep 18, 2020

lorenzwalthert mentioned this issue Apr 15, 2021

file_exists() does not perform path expansion (anymore) #325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some operations are much slower than base #295

Some operations are much slower than base #295

wch commented Sep 18, 2020 •

edited

Loading

jimhester commented Sep 18, 2020

jimhester commented Sep 18, 2020

jimhester commented Sep 18, 2020

Some operations are much slower than base #295

Some operations are much slower than base #295

Comments

wch commented Sep 18, 2020 • edited Loading

jimhester commented Sep 18, 2020

jimhester commented Sep 18, 2020

jimhester commented Sep 18, 2020

wch commented Sep 18, 2020 •

edited

Loading