Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizing various SoilProfileCollection methods #135

Closed
brownag opened this issue May 25, 2020 · 7 comments · Fixed by #136
Closed

optimizing various SoilProfileCollection methods #135

brownag opened this issue May 25, 2020 · 7 comments · Fixed by #136

Comments

@brownag
Copy link
Member

brownag commented May 25, 2020

library(profvis)
library(aqp)
library(soilDB)

data("loafercreek")

loafercreek$foo <- rep(1, nrow(loafercreek))

# make some test data -- this function is defined in @mollic branch
newdat <- permute_profile(loafercreek[1], n = 10000, boundary.attr = 'foo', min.thickness = 1)
@brownag
Copy link
Member Author

brownag commented May 25, 2020

[i,]

test.fun <- function(n) {
  idx <- round(runif(n, 0, length(newdat)))
  return(newdat[idx,])
}

microbenchmark::microbenchmark(test.fun(10),
                               test.fun(100),
                               test.fun(1000),
                               test.fun(10000))
Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
    test.fun(10) 142.3052 147.5753 180.3458 164.2151 175.1167 557.4064   100
   test.fun(100) 146.5336 150.3979 179.9773 166.7710 172.6415 541.9834   100
  test.fun(1000) 169.5357 173.5430 203.8030 190.9973 195.3322 561.2541   100
 test.fun(10000) 263.1484 288.2051 319.6335 300.3561 316.3072 735.9368   100

profvis({test.fun(100000)})

image

@brownag
Copy link
Member Author

brownag commented May 25, 2020

Here we see notable improvements (relative to above) from optimizing various parts of heavily used methods following above 2 commits (on ncss-tech/aqp@mollic branch) -- just by being careful to re-use variables and not invoke unnecessary calls to unique [if it can be avoided].

Unit: milliseconds
            expr       min        lq      mean    median        uq      max neval
    test.fun(10)  14.04452  15.02876  31.91002  18.14201  37.22526 356.3600   100
   test.fun(100)  16.70464  18.60879  52.88639  38.11054  44.03619 445.7565   100
  test.fun(1000)  35.08107  40.16366  96.29470  58.46718  75.18114 638.2684   100
 test.fun(10000) 116.34838 156.54107 219.95638 180.80176 231.97105 639.7117   100

And with profiling (with same call as above post, generating 100,000 ids (between 1 and 10000) for a n=10,000 SPC) we complete in approximately half the time. This is not a fair comparison because the previous [ method did not ensure that the input values i were unique [or sorted]. Note that the baseline "overhead" associated with the smaller n calls is considerably lower.

image

These revisions do not rely on any new package dependencies. There are additional efficiencies that could be gained using fastmatch::fmatch and %fin% and/or related facilities in data.table -- however I think it is important we have a stable, base R version of these tools that is as fast as can be.

All tests currently pass. A major change I made is to what the length(SPC) is calculated from -- it seems like this would work, as long as SPC is not corrupted. Since length is so heavily used, calling unique(horizons(spc)[[idname(spc)]]) strikes me as wasteful, although it may be necessary in a to-be-revealed edge case.

EDIT: proper benchmark using identical idx

# construct a 10000 profile SPC
library(aqp)
df <- do.call('rbind',lapply(1:10000, random_profile))
depths(df) <- id ~ top + bottom

# specify a set of slice indices (for each profile)
idx <- round(runif(1000, 0, 10000))

# benchmark (before and after changes)
microbenchmark::microbenchmark(horizons(df[idx,]), times = 1000)

Before:

> microbenchmark::microbenchmark(horizons(df[idx,]), times = 1000)
Unit: milliseconds
                expr      min       lq     mean   median       uq      max
 horizons(df[idx, ]) 11.26568 11.59819 16.22708 11.87358 14.24525 250.3832
 neval
  1000

After:

> microbenchmark::microbenchmark(horizons(df[idx,]), times = 1000)
Unit: milliseconds
                expr      min       lq     mean   median       uq      max
 horizons(df[idx, ]) 7.371399 8.130469 11.65096 8.619881 10.22078 279.7761
 neval
  1000

@brownag
Copy link
Member Author

brownag commented May 25, 2020

[,j]

# construct a 1000 profile SPC
df <- do.call('rbind',lapply(1:1000, random_profile))
depths(df) <- id ~ top + bottom

# specify a set of slice indices (for each profile)
idx <- 3:8

# benchmark (before and after changes)
microbenchmark::microbenchmark(horizons(df[,idx]))

Before:

> microbenchmark::microbenchmark(horizons(df[,idx]))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval
 horizons(df[, idx]) 735.1274 821.9373 890.1635 878.3654 919.1142 1581.625   100

After above commit (~50x faster -- with gains scaling with size of SPC) and fixes long standing issue #89 pertaining to corruption of other slots when using j-only (horizon/slice) indexing

> microbenchmark::microbenchmark(horizons(df[,idx]))
Unit: milliseconds
                expr     min       lq     mean   median       uq      max neval
 horizons(df[, idx]) 17.9027 18.23521 20.32547 18.65852 19.98068 41.87858   100

brownag added a commit that referenced this issue May 25, 2020
brownag added a commit that referenced this issue May 25, 2020
@dylanbeaudette
Copy link
Member

Excellent work. Thank you for tackling the thankless task of optimization. I'm all for merging these changes as long as all tests pass (aqp and soilDB with local NASIS data available for testing).

I'd like to chat next time we are in the office about the re-use of data vs. wastefully re-computing / extracting details from the SPC.

As for future updates: I'd like to build on data.table which does a fine job of supporting index-based sub-setting and joins. We will get there soon, I've already got a prototype replacement for slice that is based on data.table and much faster.

@brownag
Copy link
Member Author

brownag commented May 25, 2020

I have thought some about more that we could do to re-use data and intermediate calculations. Something that I was thinking would be really cool is to store in the SPC hash table or ordering of the profile/horizon ID. Storing the LUT in memory (as an atttribute in RHS object) is how fastmatch makes subsequent lookups against the same table so much faster.

data.table has neat features like .uniqueN which is faster than length(unique(x)) when all you need is n -- and of course, optimized joins rather than match based logic will be able to achieve much better endpoints.

@brownag brownag mentioned this issue May 27, 2020
@brownag brownag linked a pull request May 27, 2020 that will close this issue
@dylanbeaudette
Copy link
Member

I like the idea of pre-computing / pre-hashing some of the internal details of the SPC that are called all the time. The original motivation for things like unique(horizons(spc)[[idname(spc)]]) was to ensure that the results were always correct. Pre-computing / pre-hashing means re-computing / re-hashing when the contents change. We could simplify this process by writing a set of "hooks" that would be boilerplate for all functions that modify an SPC. We would need to weigh the added complexity and possible efficiency gain vs. time spent re-computing / re-hashing all of these values. I think that we should move slowly on optimization until other bugs / issues are solved.

@brownag
Copy link
Member Author

brownag commented May 28, 2020

Agreed. I understand the original intent -- but in some of these functions, as flame graphs above showed, is we wind up ensuring our results are correct, repeatedly, when the SPC hasn't changed.

I think for something like the profile ID order the efficiency gain would probably be worth it for large collections -- and moot for small ones. I wanted to just get the low hanging fruit with these commits, since there definitely were some optimizations to be made that didn't really cause too much change in logic for decent gains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants