Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

melting integer64 columns together with other classes leads to wrong results #3087

Open
matthiasgomolka opened this issue Oct 1, 2018 · 15 comments

Comments

@matthiasgomolka
Copy link

I think I found a bug in melt.data.table(). I want to melt several columns of class integer and integer64. When I melt these columns into a single one, the values from the former integer64 columns show weird values.

I extended the MWE with a numeric column to show that melting integer together with numeric works fine.

suppressPackageStartupMessages(library(data.table))
#> Warning: package 'data.table' was built under R version 3.5.1
suppressPackageStartupMessages(library(bit64))

DT <- data.table(key.var = "A", col_int = 2147483647L, col_num = 2147483647, col_int64 = 2147483648)
DT[, col_int64 := as.integer64(col_int64)]
print(DT)
#>    key.var    col_int    col_num  col_int64
#> 1:       A 2147483647 2147483647 2147483648

sapply(DT, class)
#>     key.var     col_int     col_num   col_int64 
#> "character"   "integer"   "numeric" "integer64"

melt.data.table(DT, id.vars = "key.var")
#> Warning in melt.data.table(DT, id.vars = "key.var"):
#> 'measure.vars' [col_int, col_num, col_int64] are not all of the same
#> type. By order of hierarchy, the molten data value column will be of type
#> 'double'. All measure variables not of type 'double' will be coerced too.
#> Check DETAILS in ?melt.data.table for more on coercion.
#>    key.var  variable         value
#> 1:       A   col_int  2.147484e+09
#> 2:       A   col_num  2.147484e+09
#> 3:       A col_int64 1.060998e-314

Created on 2018-10-01 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2018-10-01
#> Packages -----------------------------------------------------------------
#>  package    * version date       source        
#>  backports    1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base       * 3.5.0   2018-04-23 local         
#>  bit        * 1.1-14  2018-05-29 CRAN (R 3.5.0)
#>  bit64      * 0.9-7   2017-05-08 CRAN (R 3.5.0)
#>  compiler     3.5.0   2018-04-23 local         
#>  data.table * 1.11.4  2018-05-27 CRAN (R 3.5.1)
#>  datasets   * 3.5.0   2018-04-23 local         
#>  devtools     1.13.6  2018-06-27 CRAN (R 3.5.1)
#>  digest       0.6.15  2018-01-28 CRAN (R 3.5.1)
#>  evaluate     0.11    2018-07-17 CRAN (R 3.5.1)
#>  graphics   * 3.5.0   2018-04-23 local         
#>  grDevices  * 3.5.0   2018-04-23 local         
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.5.1)
#>  knitr        1.20    2018-02-20 CRAN (R 3.5.1)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.5.1)
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.5.1)
#>  methods    * 3.5.0   2018-04-23 local         
#>  Rcpp         0.12.18 2018-07-23 CRAN (R 3.5.1)
#>  rmarkdown    1.10    2018-06-11 CRAN (R 3.5.1)
#>  rprojroot    1.3-2   2018-01-03 CRAN (R 3.5.1)
#>  stats      * 3.5.0   2018-04-23 local         
#>  stringi      1.1.7   2018-03-12 CRAN (R 3.5.0)
#>  stringr      1.3.1   2018-05-10 CRAN (R 3.5.1)
#>  tools        3.5.0   2018-04-23 local         
#>  utils      * 3.5.0   2018-04-23 local         
#>  withr        2.1.2   2018-03-15 CRAN (R 3.5.1)
#>  yaml         2.2.0   2018-07-25 CRAN (R 3.5.1)

As you can see from the MWE, the former integer64 value is now something very small and obviously wrong. Even though there is a warning, it does not warn from the actual error which was produced.

NOTE: If all columns to melt are integer64 all works fine.

@franknarf1
Copy link
Contributor

The warning mentions type, not class. If you look at dput(DT), you'll see the internal value of the int64 is the double value seen in your melt result.

Btw, if you change the order of the columns, the results change:

DT <- data.table(key.var = "A", col_int = 2147483647L, col_num = 2147483647, col_int64 = 2147483648)
DT[, col_int64 := as.integer64(col_int64)]
setcolorder(DT, c("key.var", "col_int64"))
melt.data.table(DT, id.vars = "key.var")

#    key.var  variable         value
# 1:       A col_int64 1.060998e-314
# 2:       A   col_int  2.147484e+09
# 3:       A   col_num  2.147484e+09

I guess only the first columns attributes get carried over. Similar with...

> melt(data.table(a = as.IDate(Sys.Date()), b = 11L))
   variable      value
1:        a 2018-10-01
2:        b 1970-01-12
Warning message:
In melt.data.table(data.table(a = as.IDate(Sys.Date()), b = 11L)) :
  To be consistent with reshape2's melt, id.vars and measure.vars are internally guessed when both are 'NULL'. All non-numeric/integer/logical type columns are considered id.vars, which in this case are columns []. Consider providing at least one of 'id' or 'measure' vars in future.
> melt(data.table(a = as.IDate(Sys.Date()), b = 11L)[, 2:1])
   variable value
1:        b    11
2:        a 17805
Warning message:
In melt.data.table(data.table(a = as.IDate(Sys.Date()), b = 11L)[,  :
  To be consistent with reshape2's melt, id.vars and measure.vars are internally guessed when both are 'NULL'. All non-numeric/integer/logical type columns are considered id.vars, which in this case are columns []. Consider providing at least one of 'id' or 'measure' vars in future.

@matthiasgomolka
Copy link
Author

I figured out that the problem is not data.table specific. The same error occurs with integer64 columns in tibbles when gathering columns:

suppressPackageStartupMessages(library(magrittr))
#> Warning: package 'magrittr' was built under R version 3.5.1
suppressPackageStartupMessages(library(bit64))

tibble::tibble(key.var = "A",
               col_int = 2147483647L, 
               col_num = 2147483647, 
               col_int64 = 2147483648) %>% 
    dplyr::mutate(col_int64 = as.integer64(col_int64)) %>% 
    tidyr::gather(key = "variable", value = "value",
                  -key.var)
#> Warning: package 'bindrcpp' was built under R version 3.5.1
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> # A tibble: 3 x 3
#>   key.var variable      value
#>   <chr>   <chr>         <dbl>
#> 1 A       col_int   2.15e+  9
#> 2 A       col_num   2.15e+  9
#> 3 A       col_int64 1.06e-314

Created on 2018-10-02 by the reprex package (v0.2.0).

So is it a bit64 problem?

@franknarf1
Copy link
Contributor

I don't follow. The message you quoted is...

Warning: attributes are not identical across measure variables; they will be dropped

You can inspect what the warning is about, like...

library(bit64)
library(magrittr)
tmp = tibble::tibble(key.var = "A",
               col_int = 2147483647L, 
               col_num = 2147483647, 
               col_int64 = 2147483648) %>% 
    dplyr::mutate(col_int64 = as.integer64(col_int64))

lapply(tmp, attributes)
# $key.var
# NULL

# $col_int
# NULL

# $col_num
# NULL

# $col_int64
# $col_int64$class
# [1] "integer64"

This is similar to the (silent) behavior you see with melt, where only the attributes of the first column are used. You can see from my example with IDate that this behavior is not bit64-specific...

@matthiasgomolka
Copy link
Author

Thanks. Now I understand the issue.

One important difference between the integer64 and the IDate is the fact that you can recover the dates but not the large integer, right?

@franknarf1
Copy link
Contributor

franknarf1 commented Oct 2, 2018

One important difference between the integer64 and the IDate is the fact that you can recover the dates but not the large integer, right?

Hm, yeah, except with some workarounds, like...

# example data
library(data.table)
library(bit64)
DT <- data.table(key.var = "A", col_int = 2147483647L, col_num = 2147483647, col_int64 = 2147483648)
DT[, col_int64 := as.integer64(col_int64)]
res = melt.data.table(DT, id.vars = "key.var")

# workaround to add attributes
for (nn in names(aa <- attributes(DT$col_int64))) setattr(res$value, nn, aa[[nn]])
rm(aa, nn)

# workaround to adjust values
res[variable != "col_int64", value := as.integer64(unclass(value))]

@matthiasgomolka
Copy link
Author

So it's no bug? Rather an inconvenience? If so, my takeaway would be to convert all columns to be melted / gathered to the correct single type before using melt...

@franknarf1
Copy link
Contributor

I don't think it's a bug, but maybe there could be a verbose message or warning when attributes are found in measure columns.

Anyway, I think you could leave the issue open if you want more feedback.

@anwjones
Copy link

anwjones commented Oct 8, 2018

I think this may be the same or a closely related issue:

library(bit64)
library(data.table)
dt <- data.table(I32=sample(10,10), I64=as.integer64(sample(10,10)))
dt[,max(.SD),.SDcols='I32'] #works correctly returns 10
dt[,max(.SD),.SDcols='I64'] #works incorrectly returns a value close to zero which is wrong and not an integer

@franknarf1
Copy link
Contributor

@anwjones Without data.table, your result is the same...

max(data.frame(as.integer64(sample(10,10))))
# [1] 4.940656e-323

To apply max to the vector instead of the table, you could do either of these...

# returning just the value
dt[, max(.SD[[1]]), .SDcols="I64"]
# or returning it in a table
dt[, lapply(.SD, max), .SDcols="I64"]

Hope it's helpful

@anwjones
Copy link

anwjones commented Oct 8, 2018

I see your solution as a work around rather than "its not a bug". After all a data.frame doesn't show this behavior.

library(data.table)
library(bit64)
df <- data.frame(I32=sample(10,10), I64=as.integer64(sample(10,10)))
dt <- data.table(I32=sample(10,10), I64=as.integer64(sample(10,10)))

max(dt[,2]) #returns a numeric 4.940656e-323
max(df[,2]) #returns 10, so data.frames work as expected
max(dt[,1]) #returns 10, so the wrong answer is not due to applying max over a one column data.table

@franknarf1
Copy link
Contributor

@anwjones Sorry, I wasn't clear. Look at:

> class(dt[, 2])
[1] "data.table" "data.frame"
> class(df[, 2])
[1] "integer64"
> class(df[, 2, drop=FALSE])
[1] "data.frame"

Data.table behaves like drop=FALSE here. The faq vignette("datatable-faq") explains that this behavior is deliberate.

.SD is a data.table; .SD[[1]] is the first column of the data.table; lapply(.SD, max) works because data.tables and data.frames are lists; and so on. I guess you know this already, but if not, the other vignettes are a good resource.

@anwjones
Copy link

anwjones commented Oct 9, 2018

Okay - last comment, then I'll go away. So why do the integer column and the integer64 column behave differently? Is this really the intended behavior ? I can understand NA as a result, but a numeric result which is not the maximum of anything in the data.table has to be wrong ;-(

@franknarf1
Copy link
Contributor

@anwjones I can't find a reference for it, but it seems to be that max on a data.frame is like max(unlist(DF)):

> library(bit64); L = list(as.integer64(10)); max(unlist(L))
[1] 4.940656e-323

So they behave differently because integer is an atomic type, while integer64 is an invented type built on top of the atomic type "numeric". If you ask on Stack Overflow, someone would probably know the full explanation.

Unless it's a package problem or feature request, I guess we'd probably better stop commenting here, yeah.

@MichaelChirico
Copy link
Member

@franknarf1 is this indeed the same as OP's issue? wondering whether to close as being not-data.table

@franknarf1
Copy link
Contributor

@MichaelChirico No; different problem.

OP's issue is that melt carries over attributes (from #702) only looking at the first measure variable / column. I don't know if there's a good alternative to that behavior that's worth doing, though maybe there could be a message/warning when attributes are ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants