Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make aqp::union() robust to profile ID / site attr collisions. #161

Closed
dylanbeaudette opened this issue Aug 4, 2020 · 8 comments · Fixed by #215
Closed

Make aqp::union() robust to profile ID / site attr collisions. #161

dylanbeaudette opened this issue Aug 4, 2020 · 8 comments · Fixed by #215

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Aug 4, 2020

library(aqp)
library(soilDB)

x <- fetchOSD('musick')

x$hd <- hzDistinctnessCodeToOffset(x$distinctness, codes=c('very abrupt', 'abrupt', 'clear', 'gradual', 'diffuse'))
x$code <- 'OSD'

# hack: simulation with sd=0 -> replication of original data
s <- sim(x, n = 6, hz.sd = 0)

s$hd <- rep(NA, times=nrow(s))
l <- list()

hzdo <- c(0, 0.5, 2, 5, 15, 20)/2
for(i in seq_along(hzdo)) {
  ss <- s[i, ]
  ss$hd <- hzdo[i]
  l[[i]] <- ss
}

s <- aqp::union(l)
s$code <- c('none', 'very abrupt', 'abrupt', 'clear', 'gradual', 'diffuse')

## error on 2020-08-04
# https://github.com/ncss-tech/aqp/issues/161
z <- aqp::union(list(x, s))

## removing old ID "fixes" error:
# s$id <- NULL
# z <- aqp::union(list(x, s))

# no error when used like this
z <- aqp::union(list(s, x))

par(mar=c(0,0,1,0))
plotSPC(z, width=0.33, name.style='center-center', print.id=TRUE, hz.distinctness.offset = 'hd', label='code', hz.depths=TRUE, cex.names=0.8, plot.depth.axis=FALSE)
title('Horizon Boundary Distinctness', line=-1)
@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Aug 4, 2020

Relevant traceback: seems that merge.data.frame is throwing the error, as invoked by site<-. This happens later, after rbindfill or rbindlist.

9: stop(ngettext(sum(bad), "'by' must specify a uniquely valid column", 
       "'by' must specify uniquely valid columns"), domain = NA)
8: fix.by(by.y, y)
7: merge.data.frame(s, value, all.x = TRUE, sort = FALSE)
6: merge(s, value, all.x = TRUE, sort = FALSE)
5: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
4: suppressMessages(site.new <- merge(s, value, all.x = TRUE, sort = FALSE))
3: `site<-`(`*tmp*`, value = .as.data.frame.aqp(o.s, o.df.class))
2: `site<-`(`*tmp*`, value = .as.data.frame.aqp(o.s, o.df.class))
1: aqp::union(list(x, s))

@brownag
Copy link
Member

brownag commented Aug 4, 2020

Yep, this arises from unanticipated and rare? differences in data.table::rbindlist versus plyr::rbind.fill

@dylanbeaudette dylanbeaudette changed the title aqp::union() -- Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column Make aqp::union() robust to profile ID / site attr collisions. Aug 4, 2020
@dylanbeaudette
Copy link
Member Author

Take a look at permute_profile for ideas on how to make union more robust.

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Aug 4, 2020

Another test case to consider.

library(aqp)
library(soilDB)

x <- fetchOSD('musick')

# replication of original data... maybe we need a new function for this
horizons(x)$bdy <- 0
s <- permute_profile(x, n=6, boundary.attr = 'bdy')

# Error in aqp::union(list(x, s)) : non-unique profile IDs detected
z <- aqp::union(list(x, s))

# Error: duplicate names in new site / existing horizon data not allowed
z <- aqp::union(list(s, x))

@brownag
Copy link
Member

brownag commented Aug 4, 2020

From debug inside of union

> o.s
[[1]]
      id soiltaxclasslastupdated mlraoffice series_status                                                  family
1 MUSICK  2004-04-20 00:00:00+00          2   established fine-loamy, mixed, semiactive, mesic ultic haploxeralfs
  soilorder suborder   greatgroup           subgroup tax_partsize tax_partsizemod tax_ceactcl tax_reaction tax_tempcl
1  alfisols  xeralfs haploxeralfs ultic haploxeralfs   fine-loamy              NA  semiactive           NA      mesic
  originyear establishedyear descriptiondateinitial descriptiondateupdated benchmarksoilflag statsgoflag
1         NA            1963 1998-11-01 00:00:00+00 2018-12-17 17:21:57+00                 0           1
            objwlupdated           recwlupdated typelocstareaiidref typelocstareatypeiidref soilseriesiid
1 2018-12-17 17:21:57+00 2018-12-17 17:21:57+00                6662                       3         14786
  soilseriesdbiidref grpiidref tax_minclass subgroup_mod greatgroup_mod drainagecl    ac n_polygons code
1                113     23491        mixed        ultic          haplo       well 64113       1961  OSD

[[2]]
  id     id        code
1  1 MUSICK        none
2  2 MUSICK very abrupt
3  3 MUSICK      abrupt
4  4 MUSICK       clear
5  5 MUSICK     gradual
6  6 MUSICK     diffuse

This is a special case of how non-conformal IDs are handled by union. It runs afoul of the SPC rebuilding: union only recently uses depths<- so there were in the past many possibilities to create SPCs that would not pass modern validity checks or might have strange interactions with their ID column. Further, it appears that rbind.fill only takes the first of a duplicate column name whereas rbindlist retains both.

The retaining of original id by sim and calculation of a new .new_id interferes with the default heuristic of union that takes the metadata/slots of the first element in the list -- in that it assigns the name id to .new_id effectively duplicating it (such that the site<- left join fails)

So there needs to be a message, or something, for the nonconformal ID selection, possibly recalculation of a brand new profile_id and more pre-error checking before doing merges that can fail so at least a more informative error is returned?

Temporary fix is to calculate a new profile ID column for the two SPCs that is unique in both name and value prior to unioning.

@dylanbeaudette
Copy link
Member Author

Nice, this is what I needed to see. I think that the current implementation took a shortcut by re-naming the ID column vs. making a new one (I recall debating this in my head).

@brownag
Copy link
Member

brownag commented Sep 26, 2020

I think this has been fixed in this draft PR. #168

Here is a quick demo of the functionality. Still probably need to further try to break it.

# get the union branch after eb59817
remotes::install_github('ncss-tech/aqp@union')

library(aqp)
library(soilDB)

x <- fetchOSD('musick')
a <- fetchOSD('sierra')
b <- fetchOSD('holland')
y <- fetchKSSL("musick")
c <- fetchKSSL("sierra")
d <- fetchKSSL("holland")

# do a sim
f <- sim(c[1,], n = 6)

# do some permute_profile
horizons(x)$bdy <- 0
horizons(a)$bdy <- 0
horizons(b)$bdy <- 0

s <- permute_profile(x, n = 6, boundary.attr = 'bdy')
profile_id(s) <- 7:12
u <- permute_profile(a, n = 6, boundary.attr = 'bdy', new.idname = "uID")
profile_id(u) <- 13:18
v <- permute_profile(b, n = 6, boundary.attr = 'bdy', new.idname = "vID")
profile_id(v) <- 19:24

# Warning message:
# template profile ID 'idname' exists as a non-unique value in SPC element #X, trying 'alternate'

z <- aqp::union(list(y, c, x, a, u, b, f, d, s, v)) # 7, 8
z <- aqp::union(list(b, y, u, c, v, d, x, s, a))    # 3
z <- aqp::union(list(x, d, a, f, u, c, s, b, y, v)) # 5
z <- aqp::union(list(s, d, a, f, b, y, x, u, c, v)) # 2, 4

# ok, now, break it ...
c(idname(s),idname(u),idname(v))

# works without warning normally...
z <- aqp::union(list(s, d, f))

# create site attrs so no way to resolve the existing siteNames / new name conflict
site(s)$pedon_key <- LETTERS[1]
site(s)$.new_id <- letters[1]
site(d)$pID <- LETTERS[2]
site(d)$.new_id <- letters[2]
site(f)$pID <- LETTERS[3]
site(f)$pedon_key <- letters[3]

# this should error after exhausting all options for names, despite unique ID values being available
z <- aqp::union(list(s, d, f))

# Error: unable to identify a safe existing profile ID name to use for result
# In addition: Warning messages:
# 1: template profile ID '.new_id' exists as a non-unique value in SPC element #1, trying 'pID'
# 2: template profile ID 'pID' exists as a non-unique value in SPC element #2, trying 'pedon_key'
# 3: template profile ID 'pedon_key' exists as a non-unique value in SPC element #3, trying '.new_id'

# this works because "foo" doesn't exist in any SPC
z <- pbindlist(list(s, d, f), new.idname = "foo")

@brownag
Copy link
Member

brownag commented Sep 26, 2020

#168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants