join() crash #43

low-decarie · 2011-05-24T00:35:37Z

Thank you for the tremendously good work on this essential package.

My current script that causes the crash is too bulky for upload. I am working on an example script that will cause the same crash.

join() crashes my R session with:

*** caught segfault ***
address 0x0, cause 'memory not mapped'

Traceback:
1: .Call("split_indices", index, group, as.integer(n))
2: split_indices(seq_along(keys$y), keys$y, keys$n)
3: join_ids(x, y, by, all = TRUE)
4: join_all(x, y, by, type)
5: join(counts.transplant, counts.clamy, by = "Water.plot")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

within RStudio, this causes the whole app to crash.

Thank you and have an excellent day,

Etienne

mndrs · 2011-07-29T21:09:55Z

I have this same exact issue. I've had join in plyr crash R 2.13.1 and 2.12.0 (as well as RStudio).

hadley · 2011-08-07T19:10:32Z

Here's a reproducible example from @imark:

m1<-data.frame(cl=c(1,2), file=c("hi", "low"))
m2<-data.frame(file=c("1776.txt", "About.txt"), actual=c(11.5, 4.5), stringsAsFactors=F)
join(m1, m2, "file")

brendano · 2011-08-28T16:43:14Z

I've been getting this too. There's something funny going on with factors vs character join columns, and the presence of NA's.

Factor vs. Character

Works:

d1 = data.frame(x=c('a','b'), y=1:2, stringsAsFactors=F)
d2 = data.frame(x=c('b','d'), z=1:2, stringsAsFactors=F)
join(d1,d2)

Works, even though the factors have different levels:

d1 = data.frame(x=c('a','b'), y=1:2)
d2 = data.frame(x=c('b','d'), z=1:2)
join(d1,d2)

Works, though gets the wrong answer:

d1 = data.frame(x=c('a','b'), y=1:2, stringsAsFactors=F)
d2 = data.frame(x=c('b','d'), z=1:2)
join(d1,d2)

Crashes:

d1 = data.frame(x=c('a','b'), y=1:2)
d2 = data.frame(x=c('b','d'), z=1:2, stringsAsFactors=F)
join(d1,d2)

Specifically, it's a segfault in split_indices.

NA's in factors

When the right join column (under a left join) is a factor and has an NA in it, it wants to crash.

Works:

d1 = data.frame(x=c('a','b'), y=1:2, stringsAsFactors=F)
d2 = data.frame(x=c('b',NA), z=1:2, stringsAsFactors=F)
join(d1,d2)

Works:

d1 = data.frame(x=c(NA,'b'), y=1:2)
d2 = data.frame(x=c('b','c'), z=1:2)
join(d1,d2)

Crashes:

d1 = data.frame(x=c('a','b'), y=1:2)
d2 = data.frame(x=c('b',NA), z=1:2)
join(d1,d2)

Again, the segfault is in split_indices.

NA's in numerics are fine

These both work. The problem seems constricted to factor vectors.

d1 = data.frame(x=c(10,11), y=1:2)
d2 = data.frame(x=c(11,12), z=1:2)
join(d1,d2)

d1 = data.frame(x=c(10,11), y=1:2)
d2 = data.frame(x=c(11,NA), z=1:2)
join(d1,d2)

d1 = data.frame(x=c(NA,11), y=1:2)
d2 = data.frame(x=c(11,12), z=1:2)
join(d1,d2)

Non-determinism

Sometimes, instead of a segfault I get a benign error message in split_indices. If I start R fresh and do a similar setup as the NA version above, just with larger data frames:

d1 = data.frame(x=letters[1:4], y=1:4)
d2 = data.frame(x=letters[2:5], z=1:4)
d2$x[2] = NA
join(d1,d2)

I only get the error

Error in split_indices(seq_along(keys$y), keys$y, keys$n) : 
  INTEGER() can only be applied to a 'integer', not a 'character'
Calls: join -> join_all -> join_ids -> split_indices -> .Call

(which, by the way, seems strange since the columns are factors, not characters.)

But if I do it a few more times, I get the segfault, same as all the above crashes:

 *** caught segfault ***
address 0x202, cause 'memory not mapped'

Traceback:
 1: .Call("split_indices", index, group, as.integer(n))
 2: split_indices(seq_along(keys$y), keys$y, keys$n)
 3: join_ids(x, y, by, all = TRUE)
 4: join_all(x, y, by, type)
 5: join(d1, d2)

hadley closed this as completed in e493ef4 Oct 30, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

join() crash #43

join() crash #43

low-decarie commented May 24, 2011

mndrs commented Jul 29, 2011

hadley commented Aug 7, 2011

brendano commented Aug 28, 2011

join() crash #43

join() crash #43

Comments

low-decarie commented May 24, 2011

mndrs commented Jul 29, 2011

hadley commented Aug 7, 2011

brendano commented Aug 28, 2011

Factor vs. Character

NA's in factors

NA's in numerics are fine

Non-determinism