melting integer64 columns together with other classes leads to wrong results #3087

matthiasgomolka · 2018-10-01T13:44:01Z

I think I found a bug in melt.data.table(). I want to melt several columns of class integer and integer64. When I melt these columns into a single one, the values from the former integer64 columns show weird values.

I extended the MWE with a numeric column to show that melting integer together with numeric works fine.

suppressPackageStartupMessages(library(data.table))
#> Warning: package 'data.table' was built under R version 3.5.1
suppressPackageStartupMessages(library(bit64))

DT <- data.table(key.var = "A", col_int = 2147483647L, col_num = 2147483647, col_int64 = 2147483648)
DT[, col_int64 := as.integer64(col_int64)]
print(DT)
#>    key.var    col_int    col_num  col_int64
#> 1:       A 2147483647 2147483647 2147483648

sapply(DT, class)
#>     key.var     col_int     col_num   col_int64 
#> "character"   "integer"   "numeric" "integer64"

melt.data.table(DT, id.vars = "key.var")
#> Warning in melt.data.table(DT, id.vars = "key.var"):
#> 'measure.vars' [col_int, col_num, col_int64] are not all of the same
#> type. By order of hierarchy, the molten data value column will be of type
#> 'double'. All measure variables not of type 'double' will be coerced too.
#> Check DETAILS in ?melt.data.table for more on coercion.
#>    key.var  variable         value
#> 1:       A   col_int  2.147484e+09
#> 2:       A   col_num  2.147484e+09
#> 3:       A col_int64 1.060998e-314

Created on 2018-10-01 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2018-10-01
#> Packages -----------------------------------------------------------------
#>  package    * version date       source        
#>  backports    1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base       * 3.5.0   2018-04-23 local         
#>  bit        * 1.1-14  2018-05-29 CRAN (R 3.5.0)
#>  bit64      * 0.9-7   2017-05-08 CRAN (R 3.5.0)
#>  compiler     3.5.0   2018-04-23 local         
#>  data.table * 1.11.4  2018-05-27 CRAN (R 3.5.1)
#>  datasets   * 3.5.0   2018-04-23 local         
#>  devtools     1.13.6  2018-06-27 CRAN (R 3.5.1)
#>  digest       0.6.15  2018-01-28 CRAN (R 3.5.1)
#>  evaluate     0.11    2018-07-17 CRAN (R 3.5.1)
#>  graphics   * 3.5.0   2018-04-23 local         
#>  grDevices  * 3.5.0   2018-04-23 local         
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.5.1)
#>  knitr        1.20    2018-02-20 CRAN (R 3.5.1)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.5.1)
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.5.1)
#>  methods    * 3.5.0   2018-04-23 local         
#>  Rcpp         0.12.18 2018-07-23 CRAN (R 3.5.1)
#>  rmarkdown    1.10    2018-06-11 CRAN (R 3.5.1)
#>  rprojroot    1.3-2   2018-01-03 CRAN (R 3.5.1)
#>  stats      * 3.5.0   2018-04-23 local         
#>  stringi      1.1.7   2018-03-12 CRAN (R 3.5.0)
#>  stringr      1.3.1   2018-05-10 CRAN (R 3.5.1)
#>  tools        3.5.0   2018-04-23 local         
#>  utils      * 3.5.0   2018-04-23 local         
#>  withr        2.1.2   2018-03-15 CRAN (R 3.5.1)
#>  yaml         2.2.0   2018-07-25 CRAN (R 3.5.1)

As you can see from the MWE, the former integer64 value is now something very small and obviously wrong. Even though there is a warning, it does not warn from the actual error which was produced.

NOTE: If all columns to melt are integer64 all works fine.

The text was updated successfully, but these errors were encountered:

franknarf1 · 2018-10-01T15:17:58Z

The warning mentions type, not class. If you look at dput(DT), you'll see the internal value of the int64 is the double value seen in your melt result.

Btw, if you change the order of the columns, the results change:

DT <- data.table(key.var = "A", col_int = 2147483647L, col_num = 2147483647, col_int64 = 2147483648)
DT[, col_int64 := as.integer64(col_int64)]
setcolorder(DT, c("key.var", "col_int64"))
melt.data.table(DT, id.vars = "key.var")

#    key.var  variable         value
# 1:       A col_int64 1.060998e-314
# 2:       A   col_int  2.147484e+09
# 3:       A   col_num  2.147484e+09

I guess only the first columns attributes get carried over. Similar with...

> melt(data.table(a = as.IDate(Sys.Date()), b = 11L))
   variable      value
1:        a 2018-10-01
2:        b 1970-01-12
Warning message:
In melt.data.table(data.table(a = as.IDate(Sys.Date()), b = 11L)) :
  To be consistent with reshape2's melt, id.vars and measure.vars are internally guessed when both are 'NULL'. All non-numeric/integer/logical type columns are considered id.vars, which in this case are columns []. Consider providing at least one of 'id' or 'measure' vars in future.
> melt(data.table(a = as.IDate(Sys.Date()), b = 11L)[, 2:1])
   variable value
1:        b    11
2:        a 17805
Warning message:
In melt.data.table(data.table(a = as.IDate(Sys.Date()), b = 11L)[,  :
  To be consistent with reshape2's melt, id.vars and measure.vars are internally guessed when both are 'NULL'. All non-numeric/integer/logical type columns are considered id.vars, which in this case are columns []. Consider providing at least one of 'id' or 'measure' vars in future.

matthiasgomolka · 2018-10-02T06:20:50Z

I figured out that the problem is not data.table specific. The same error occurs with integer64 columns in tibbles when gathering columns:

suppressPackageStartupMessages(library(magrittr))
#> Warning: package 'magrittr' was built under R version 3.5.1
suppressPackageStartupMessages(library(bit64))

tibble::tibble(key.var = "A",
               col_int = 2147483647L, 
               col_num = 2147483647, 
               col_int64 = 2147483648) %>% 
    dplyr::mutate(col_int64 = as.integer64(col_int64)) %>% 
    tidyr::gather(key = "variable", value = "value",
                  -key.var)
#> Warning: package 'bindrcpp' was built under R version 3.5.1
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> # A tibble: 3 x 3
#>   key.var variable      value
#>   <chr>   <chr>         <dbl>
#> 1 A       col_int   2.15e+  9
#> 2 A       col_num   2.15e+  9
#> 3 A       col_int64 1.06e-314

Created on 2018-10-02 by the reprex package (v0.2.0).

So is it a bit64 problem?

franknarf1 · 2018-10-02T16:06:39Z

I don't follow. The message you quoted is...

Warning: attributes are not identical across measure variables; they will be dropped

You can inspect what the warning is about, like...

library(bit64)
library(magrittr)
tmp = tibble::tibble(key.var = "A",
               col_int = 2147483647L, 
               col_num = 2147483647, 
               col_int64 = 2147483648) %>% 
    dplyr::mutate(col_int64 = as.integer64(col_int64))

lapply(tmp, attributes)
# $key.var
# NULL

# $col_int
# NULL

# $col_num
# NULL

# $col_int64
# $col_int64$class
# [1] "integer64"

This is similar to the (silent) behavior you see with melt, where only the attributes of the first column are used. You can see from my example with IDate that this behavior is not bit64-specific...

matthiasgomolka · 2018-10-02T17:09:03Z

Thanks. Now I understand the issue.

One important difference between the integer64 and the IDate is the fact that you can recover the dates but not the large integer, right?

franknarf1 · 2018-10-02T17:36:38Z

One important difference between the integer64 and the IDate is the fact that you can recover the dates but not the large integer, right?

Hm, yeah, except with some workarounds, like...

# example data
library(data.table)
library(bit64)
DT <- data.table(key.var = "A", col_int = 2147483647L, col_num = 2147483647, col_int64 = 2147483648)
DT[, col_int64 := as.integer64(col_int64)]
res = melt.data.table(DT, id.vars = "key.var")

# workaround to add attributes
for (nn in names(aa <- attributes(DT$col_int64))) setattr(res$value, nn, aa[[nn]])
rm(aa, nn)

# workaround to adjust values
res[variable != "col_int64", value := as.integer64(unclass(value))]

matthiasgomolka · 2018-10-04T11:43:10Z

So it's no bug? Rather an inconvenience? If so, my takeaway would be to convert all columns to be melted / gathered to the correct single type before using melt...

franknarf1 · 2018-10-04T17:40:13Z

I don't think it's a bug, but maybe there could be a verbose message or warning when attributes are found in measure columns.

Anyway, I think you could leave the issue open if you want more feedback.

anwjones · 2018-10-08T05:24:02Z

I think this may be the same or a closely related issue:

library(bit64)
library(data.table)
dt <- data.table(I32=sample(10,10), I64=as.integer64(sample(10,10)))
dt[,max(.SD),.SDcols='I32'] #works correctly returns 10
dt[,max(.SD),.SDcols='I64'] #works incorrectly returns a value close to zero which is wrong and not an integer

franknarf1 · 2018-10-08T17:08:03Z

@anwjones Without data.table, your result is the same...

max(data.frame(as.integer64(sample(10,10))))
# [1] 4.940656e-323

To apply max to the vector instead of the table, you could do either of these...

# returning just the value
dt[, max(.SD[[1]]), .SDcols="I64"]
# or returning it in a table
dt[, lapply(.SD, max), .SDcols="I64"]

Hope it's helpful

anwjones · 2018-10-08T23:34:09Z

I see your solution as a work around rather than "its not a bug". After all a data.frame doesn't show this behavior.

library(data.table)
library(bit64)
df <- data.frame(I32=sample(10,10), I64=as.integer64(sample(10,10)))
dt <- data.table(I32=sample(10,10), I64=as.integer64(sample(10,10)))

max(dt[,2]) #returns a numeric 4.940656e-323
max(df[,2]) #returns 10, so data.frames work as expected
max(dt[,1]) #returns 10, so the wrong answer is not due to applying max over a one column data.table

franknarf1 · 2018-10-09T00:28:26Z

@anwjones Sorry, I wasn't clear. Look at:

> class(dt[, 2])
[1] "data.table" "data.frame"
> class(df[, 2])
[1] "integer64"
> class(df[, 2, drop=FALSE])
[1] "data.frame"

Data.table behaves like drop=FALSE here. The faq vignette("datatable-faq") explains that this behavior is deliberate.

.SD is a data.table; .SD[[1]] is the first column of the data.table; lapply(.SD, max) works because data.tables and data.frames are lists; and so on. I guess you know this already, but if not, the other vignettes are a good resource.

anwjones · 2018-10-09T01:45:59Z

Okay - last comment, then I'll go away. So why do the integer column and the integer64 column behave differently? Is this really the intended behavior ? I can understand NA as a result, but a numeric result which is not the maximum of anything in the data.table has to be wrong ;-(

franknarf1 · 2018-10-09T02:00:54Z

@anwjones I can't find a reference for it, but it seems to be that max on a data.frame is like max(unlist(DF)):

> library(bit64); L = list(as.integer64(10)); max(unlist(L))
[1] 4.940656e-323

So they behave differently because integer is an atomic type, while integer64 is an invented type built on top of the atomic type "numeric". If you ask on Stack Overflow, someone would probably know the full explanation.

Unless it's a package problem or feature request, I guess we'd probably better stop commenting here, yeah.

MichaelChirico · 2018-10-09T03:41:42Z

@franknarf1 is this indeed the same as OP's issue? wondering whether to close as being not-data.table

franknarf1 · 2018-10-09T12:44:33Z

@MichaelChirico No; different problem.

OP's issue is that melt carries over attributes (from #702) only looking at the first measure variable / column. I don't know if there's a good alternative to that behavior that's worth doing, though maybe there could be a message/warning when attributes are ignored.

jangorecki added the bit64 label Jan 30, 2019

jangorecki added the reshape dcast melt label Apr 5, 2020

MichaelChirico added the idate/itime label Sep 8, 2020

MichaelChirico mentioned this issue Oct 22, 2020

Joining tables based on time with different time-zones - no warning #4773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

melting integer64 columns together with other classes leads to wrong results #3087

melting integer64 columns together with other classes leads to wrong results #3087

matthiasgomolka commented Oct 1, 2018

franknarf1 commented Oct 1, 2018

matthiasgomolka commented Oct 2, 2018

franknarf1 commented Oct 2, 2018

matthiasgomolka commented Oct 2, 2018

franknarf1 commented Oct 2, 2018 •

edited

Loading

matthiasgomolka commented Oct 4, 2018

franknarf1 commented Oct 4, 2018

anwjones commented Oct 8, 2018

franknarf1 commented Oct 8, 2018

anwjones commented Oct 8, 2018

franknarf1 commented Oct 9, 2018

anwjones commented Oct 9, 2018

franknarf1 commented Oct 9, 2018

MichaelChirico commented Oct 9, 2018

franknarf1 commented Oct 9, 2018

melting integer64 columns together with other classes leads to wrong results #3087

melting integer64 columns together with other classes leads to wrong results #3087

Comments

matthiasgomolka commented Oct 1, 2018

franknarf1 commented Oct 1, 2018

matthiasgomolka commented Oct 2, 2018

franknarf1 commented Oct 2, 2018

matthiasgomolka commented Oct 2, 2018

franknarf1 commented Oct 2, 2018 • edited Loading

matthiasgomolka commented Oct 4, 2018

franknarf1 commented Oct 4, 2018

anwjones commented Oct 8, 2018

franknarf1 commented Oct 8, 2018

anwjones commented Oct 8, 2018

franknarf1 commented Oct 9, 2018

anwjones commented Oct 9, 2018

franknarf1 commented Oct 9, 2018

MichaelChirico commented Oct 9, 2018

franknarf1 commented Oct 9, 2018

franknarf1 commented Oct 2, 2018 •

edited

Loading