optimizing various SoilProfileCollection methods #135

brownag · 2020-05-25T00:43:54Z

library(profvis)
library(aqp)
library(soilDB)

data("loafercreek")

loafercreek$foo <- rep(1, nrow(loafercreek))

# make some test data -- this function is defined in @mollic branch
newdat <- permute_profile(loafercreek[1], n = 10000, boundary.attr = 'foo', min.thickness = 1)

The text was updated successfully, but these errors were encountered:

brownag · 2020-05-25T00:44:03Z

[i,]

test.fun <- function(n) {
  idx <- round(runif(n, 0, length(newdat)))
  return(newdat[idx,])
}

microbenchmark::microbenchmark(test.fun(10),
                               test.fun(100),
                               test.fun(1000),
                               test.fun(10000))

Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
    test.fun(10) 142.3052 147.5753 180.3458 164.2151 175.1167 557.4064   100
   test.fun(100) 146.5336 150.3979 179.9773 166.7710 172.6415 541.9834   100
  test.fun(1000) 169.5357 173.5430 203.8030 190.9973 195.3322 561.2541   100
 test.fun(10000) 263.1484 288.2051 319.6335 300.3561 316.3072 735.9368   100

profvis({test.fun(100000)})

brownag · 2020-05-25T03:11:10Z

Here we see notable improvements (relative to above) from optimizing various parts of heavily used methods following above 2 commits (on ncss-tech/aqp@mollic branch) -- just by being careful to re-use variables and not invoke unnecessary calls to unique [if it can be avoided].

Unit: milliseconds
            expr       min        lq      mean    median        uq      max neval
    test.fun(10)  14.04452  15.02876  31.91002  18.14201  37.22526 356.3600   100
   test.fun(100)  16.70464  18.60879  52.88639  38.11054  44.03619 445.7565   100
  test.fun(1000)  35.08107  40.16366  96.29470  58.46718  75.18114 638.2684   100
 test.fun(10000) 116.34838 156.54107 219.95638 180.80176 231.97105 639.7117   100

And with profiling (with same call as above post, generating 100,000 ids (between 1 and 10000) for a n=10,000 SPC) we complete in approximately half the time. This is not a fair comparison because the previous [ method did not ensure that the input values i were unique [or sorted]. Note that the baseline "overhead" associated with the smaller n calls is considerably lower.

These revisions do not rely on any new package dependencies. There are additional efficiencies that could be gained using fastmatch::fmatch and %fin% and/or related facilities in data.table -- however I think it is important we have a stable, base R version of these tools that is as fast as can be.

All tests currently pass. A major change I made is to what the length(SPC) is calculated from -- it seems like this would work, as long as SPC is not corrupted. Since length is so heavily used, calling unique(horizons(spc)[[idname(spc)]]) strikes me as wasteful, although it may be necessary in a to-be-revealed edge case.

EDIT: proper benchmark using identical idx

# construct a 10000 profile SPC
library(aqp)
df <- do.call('rbind',lapply(1:10000, random_profile))
depths(df) <- id ~ top + bottom

# specify a set of slice indices (for each profile)
idx <- round(runif(1000, 0, 10000))

# benchmark (before and after changes)
microbenchmark::microbenchmark(horizons(df[idx,]), times = 1000)

Before:

> microbenchmark::microbenchmark(horizons(df[idx,]), times = 1000)
Unit: milliseconds
                expr      min       lq     mean   median       uq      max
 horizons(df[idx, ]) 11.26568 11.59819 16.22708 11.87358 14.24525 250.3832
 neval
  1000

After:

> microbenchmark::microbenchmark(horizons(df[idx,]), times = 1000)
Unit: milliseconds
                expr      min       lq     mean   median       uq      max
 horizons(df[idx, ]) 7.371399 8.130469 11.65096 8.619881 10.22078 279.7761
 neval
  1000

…#135

brownag · 2020-05-25T06:31:36Z

[,j]

# construct a 1000 profile SPC
df <- do.call('rbind',lapply(1:1000, random_profile))
depths(df) <- id ~ top + bottom

# specify a set of slice indices (for each profile)
idx <- 3:8

# benchmark (before and after changes)
microbenchmark::microbenchmark(horizons(df[,idx]))

Before:

> microbenchmark::microbenchmark(horizons(df[,idx]))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval
 horizons(df[, idx]) 735.1274 821.9373 890.1635 878.3654 919.1142 1581.625   100

After above commit (~50x faster -- with gains scaling with size of SPC) and fixes long standing issue #89 pertaining to corruption of other slots when using j-only (horizon/slice) indexing

> microbenchmark::microbenchmark(horizons(df[,idx]))
Unit: milliseconds
                expr     min       lq     mean   median       uq      max neval
 horizons(df[, idx]) 17.9027 18.23521 20.32547 18.65852 19.98068 41.87858   100

dylanbeaudette · 2020-05-25T18:43:54Z

Excellent work. Thank you for tackling the thankless task of optimization. I'm all for merging these changes as long as all tests pass (aqp and soilDB with local NASIS data available for testing).

I'd like to chat next time we are in the office about the re-use of data vs. wastefully re-computing / extracting details from the SPC.

As for future updates: I'd like to build on data.table which does a fine job of supporting index-based sub-setting and joins. We will get there soon, I've already got a prototype replacement for slice that is based on data.table and much faster.

brownag · 2020-05-25T19:12:46Z

I have thought some about more that we could do to re-use data and intermediate calculations. Something that I was thinking would be really cool is to store in the SPC hash table or ordering of the profile/horizon ID. Storing the LUT in memory (as an atttribute in RHS object) is how fastmatch makes subsequent lookups against the same table so much faster.

data.table has neat features like .uniqueN which is faster than length(unique(x)) when all you need is n -- and of course, optimized joins rather than match based logic will be able to achieve much better endpoints.

dylanbeaudette · 2020-05-27T23:42:15Z

I like the idea of pre-computing / pre-hashing some of the internal details of the SPC that are called all the time. The original motivation for things like unique(horizons(spc)[[idname(spc)]]) was to ensure that the results were always correct. Pre-computing / pre-hashing means re-computing / re-hashing when the contents change. We could simplify this process by writing a set of "hooks" that would be boilerplate for all functions that modify an SPC. We would need to weigh the added complexity and possible efficiency gain vs. time spent re-computing / re-hashing all of these values. I think that we should move slowly on optimization until other bugs / issues are solved.

brownag · 2020-05-28T02:01:40Z

Agreed. I understand the original intent -- but in some of these functions, as flame graphs above showed, is we wind up ensuring our results are correct, repeatedly, when the SPC hasn't changed.

I think for something like the profile ID order the efficiency gain would probably be worth it for large collections -- and moot for small ones. I wanted to just get the low hanging fruit with these commits, since there definitely were some optimizations to be made that didn't really cause too much change in logic for decent gains.

brownag added a commit that referenced this issue May 25, 2020

reuse profile_id/site -- the duplicate calls add up for large SPCs #135

efb0909

brownag added a commit that referenced this issue May 25, 2020

more optimizations to SPC methods #135

6f47409

brownag added a commit that referenced this issue May 25, 2020

correct propagation of j-indexing to slots, fixes #89 also much faster …

20d39e8

…#135

brownag added a commit that referenced this issue May 25, 2020

drop=FALSE #89 #135

8a0a57a

brownag added a commit that referenced this issue May 25, 2020

whoops fix for single-profile SPC/glom j-indexing #135 #89

06703d9

brownag added a commit that referenced this issue May 25, 2020

cleanup #135

dcce71c

brownag mentioned this issue May 27, 2020

Mollic #136

Merged

brownag linked a pull request May 27, 2020 that will close this issue

Mollic #136

Merged

dylanbeaudette closed this as completed in #136 May 28, 2020

dylanbeaudette mentioned this issue Jul 15, 2020

Phase-in of data.table support in SPC objects and functions returning SPC objects #157

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizing various SoilProfileCollection methods #135

optimizing various SoilProfileCollection methods #135

brownag commented May 25, 2020 •

edited

Loading

brownag commented May 25, 2020 •

edited

Loading

brownag commented May 25, 2020 •

edited

Loading

brownag commented May 25, 2020

dylanbeaudette commented May 25, 2020

brownag commented May 25, 2020 •

edited

Loading

dylanbeaudette commented May 27, 2020

brownag commented May 28, 2020

optimizing various SoilProfileCollection methods #135

optimizing various SoilProfileCollection methods #135

Comments

brownag commented May 25, 2020 • edited Loading

brownag commented May 25, 2020 • edited Loading

[i,]

brownag commented May 25, 2020 • edited Loading

brownag commented May 25, 2020

[,j]

dylanbeaudette commented May 25, 2020

brownag commented May 25, 2020 • edited Loading

dylanbeaudette commented May 27, 2020

brownag commented May 28, 2020

brownag commented May 25, 2020 •

edited

Loading

brownag commented May 25, 2020 •

edited

Loading

brownag commented May 25, 2020 •

edited

Loading

brownag commented May 25, 2020 •

edited

Loading