scaling up profileApply #112

brownag · 2020-01-18T03:27:59Z

As has long been known, profileApply don't scale well.

Below is a first stab at speeding it up; benchmark times are on an old MacBook.

Splitting the operation up into several runs of profileApply (resulting in more, smaller lists) seems to enhance performance significantly in the examples below with 1,000 and 10,000 random profiles.
The chunk size is hardcoded at 100 profiles -- which seems to enhance performance in the thousands to tens of thousands. There is, as expected, some slight overhead relative to straight profileApply when using 100 profiles. Not sure what the upper end of this will be, but presumably as the number of chunks starts to get large, the effective rate of number of profiles applied will decrease.

This is run in a single thread, but this layout lends itself reasonably well to adding parallel computation option.

chunkApply <- function(object, FUN, simplify = FALSE, ...) {
  n <- length(object)
  grp <- sort(1:n %% max(1, round(n / 100))) + 1 
  res <- do.call('c', lapply(split(1:length(grp), grp), function(idx) {
    profileApply(object[idx, ], FUN, simplify = simplify, ...)
  }))
  names(res) <- profile_id(object)
  return(res)
}

library(aqp)

# generate 10,000 random profiles and promote to SPC
foo <- do.call('rbind', lapply(as.list(1:10000), random_profile))
depths(foo) <- id ~ top + bottom

# a "simple" function that returns a "complex" result
simpleFunction <- function(p) data.frame(horizons(p)[2,2:3])

c1 <- system.time(chunkApply(foo[1:100,], simpleFunction))
p1 <- system.time(profileApply(foo[1:100,], simpleFunction, simplify = FALSE))
c2 <- system.time(chunkApply(foo[1:1000,], simpleFunction))
p2 <- system.time(profileApply(foo[1:1000,], simpleFunction, simplify = FALSE))
c3 <- system.time(chunkApply(foo, simpleFunction))
p3 <- system.time(profileApply(foo, simpleFunction, simplify = FALSE))

100 PROFILES

c1

user system elapsed
0.254 0.000 0.267

p1

user system elapsed
0.241 0.001 0.255

1,000 PROFILES

c2

user system elapsed
2.335 0.028 2.689

p2

user system elapsed
5.057 0.024 5.596

10,000 PROFILES

c3

user system elapsed
29.734 0.236 33.839

p3

user system elapsed
455.586 1.294 517.049

The text was updated successfully, but these errors were encountered:

dylanbeaudette · 2020-01-18T07:15:24Z

Excellent, this is almost a drop-in upgrade to profileApply. This change combined with ideas from #111 would be a nice upgrade.

I'll have to think / research about how we can automatically invoke parallelism without bringing-in a bunch of additional dependencies. Also, enforcing a unique horizon or site level ID via #111 would solve the problem of a non-deterministic re-ordering of results returned by parallel processing.

brownag · 2020-01-18T07:47:47Z

I wondered about an argument that takes an optional parallel lapply-like function -- eg furrr::future_map() or future.apply::future_lapply. The default could be base R lapply. In my testing so far, I have had limited benefit from using these fancy options over good ol' lapply.

And yes, this would combine well with the frameify option #111

brownag · 2020-01-18T20:50:47Z

library(aqp)

foo <- do.call('rbind', lapply(as.list(1:100000), random_profile))
depths(foo) <- id ~ top + bottom

simpleFunction <- function(p) data.frame(horizons(p)[2,2:3])

idx <- c(seq(100,900,100), seq(1000,10000,1000), seq(10000, 100000, 10000))
idx.sub <- idx[idx <= 100000]
res <- do.call('rbind', lapply(as.list(idx.sub), function(i) {
  system.time(profileApply(foo[1:i,], simpleFunction, simplify=F))
}))

res2 <- do.call('rbind', lapply(as.list(idx.sub), function(i) {
  system.time(chunkApply(foo[1:i,], simpleFunction))
}))

plot(res[,3]~idx.sub, type="l", lwd=2, main="Time to *Apply n Profiles",
     xlab="Number of Profiles",ylab="Time, seconds")
lines(res2[,3]~idx.sub, col="GREEN", lwd=2)
legend('topleft', legend = c("profileApply","chunkApply"), lty=1, lwd=2, col=c("BLACK","GREEN"))

dylanbeaudette · 2020-01-18T22:34:15Z

Sweet! Much better than invoking parallel voodoo. It would like like this is a drop-in addition to profileApply that wouldn't require any additional work by the operator, other than manual adjustment of the chunk size.

…eApply #111 #112

dylanbeaudette · 2020-01-21T19:29:20Z

Pending a couple more tests, this is just about ready to go. Testing on 10k profiles:

60 seconds using a single chunk
14 seconds using chunk.size=100
17 seconds using chunk.size=1000

dylanbeaudette · 2020-01-21T19:34:38Z

Note that the profiling done above includes the additional overhead of [-subsetting SPC objects, although it is clearly not a large portion of the total time.

brownag · 2020-01-21T20:58:52Z

Sahweet. With b4c171f I think my last dangling issues with this issue are resolved and this issue can be closed

I had made a note pertaining to the sort option needing to be false, but didn't get around to changing over the weekend. Aand good catch with stringsAsFactors -- that one always gets me

brownag added the enhancement label Jan 18, 2020

brownag added a commit that referenced this issue Jan 19, 2020

implement "chunkApply" routine and add "frameify" argument for profil…

2c0291b

…eApply #111 #112

brownag mentioned this issue Jan 19, 2020

upgrades to profileApply #113

Merged

brownag closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling up profileApply #112

scaling up profileApply #112

brownag commented Jan 18, 2020 •

edited

Loading

dylanbeaudette commented Jan 18, 2020

brownag commented Jan 18, 2020 •

edited

Loading

brownag commented Jan 18, 2020 •

edited

Loading

dylanbeaudette commented Jan 18, 2020

dylanbeaudette commented Jan 21, 2020

dylanbeaudette commented Jan 21, 2020

brownag commented Jan 21, 2020

scaling up profileApply #112

scaling up profileApply #112

Comments

brownag commented Jan 18, 2020 • edited Loading

dylanbeaudette commented Jan 18, 2020

brownag commented Jan 18, 2020 • edited Loading

brownag commented Jan 18, 2020 • edited Loading

dylanbeaudette commented Jan 18, 2020

dylanbeaudette commented Jan 21, 2020

dylanbeaudette commented Jan 21, 2020

brownag commented Jan 21, 2020

brownag commented Jan 18, 2020 •

edited

Loading

brownag commented Jan 18, 2020 •

edited

Loading

brownag commented Jan 18, 2020 •

edited

Loading