Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with POSIX results after grouping? #689

Closed
geneorama opened this issue Jun 9, 2014 · 21 comments
Closed

Bug with POSIX results after grouping? #689

geneorama opened this issue Jun 9, 2014 · 21 comments

Comments

@geneorama
Copy link

I ran into this problem last week, and I eventually found that using "IDate" solves the problem. However, I still would call this a bug.

In my example I start with a data table that has two variables:

  • LEVELS: Two levels, which will be used for grouping
  • DATES: Sequence of POSIXct dates, with timezone = GMT

I create a new column "MINDATE" for "LEVELS" == "a"

I expect "MINDATE" will be "2010-01-01" or NA, but it becomes "2009-12-31 18:00:00" or NA. The original date column is GMT, but MINDATE becomes GMT-6, which is my local timezone.

Thanks,

Gene

library(data.table)

## Create the example
df <- data.table(LEVELS = rep(letters[1:2], each=5),
                 DATES = as.POSIXct('2010-01-01', tz="gmt") + 
                     seq(0, 86400*9, 86400))

## Create a new column with the minimum date
df[LEVELS=="a", MINDATE := min(DATES), LEVELS]

## The results are not what I expect
df

#    LEVELS      DATES             MINDATE
#1:      a 2010-01-01 2009-12-31 18:00:00
#2:      a 2010-01-02 2009-12-31 18:00:00
#3:      a 2010-01-03 2009-12-31 18:00:00
#4:      a 2010-01-04 2009-12-31 18:00:00
#5:      a 2010-01-05 2009-12-31 18:00:00
#6:      b 2010-01-06                <NA>
#7:      b 2010-01-07                <NA>
#8:      b 2010-01-08                <NA>
#9:      b 2010-01-09                <NA>
#10:     b 2010-01-10                <NA>
# >

Also, I'm adding in my system info for the two machines that I used to test this:

--------------- COMPUTER 1 SESSION INFO ---------------

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2 geneorama_1.2   

loaded via a namespace (and not attached):
 [1] bitops_1.0-6    caTools_1.17    digest_0.6.4    httpuv_1.3.0    jsonlite_0.9.7  plyr_1.8.1      Rcpp_0.11.1    
 [8] reshape2_1.4    rjson_0.2.13    RJSONIO_1.2-0.2 rmongodb_1.6.5  shiny_0.9.1     stringr_0.6.2   tools_3.1.0    
[15] xtable_1.7-3 

--------------- COMPUTER 2 SESSION INFO ---------------

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C        
 [5] LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C           
 [9] LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plyr_1.8.1                      C50_0.1.0-19                   
 [3] MASS_7.3-29                     partykit_0.8-0                 
 [5] AppliedPredictiveModeling_1.1-5 e1071_1.6-3                    
 [7] party_1.0-14                    modeltools_0.2-21              
 [9] strucchange_1.5-0               sandwich_2.3-0                 
[11] zoo_1.7-11                      rpart_4.1-3                    
[13] corrplot_0.73                   reshape2_1.4                   
[15] caret_6.0-30                    lattice_0.20-29                
[17] ggplot2_1.0.0                   data.table_1.9.2               
[19] geneorama_1.2                  

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-4 CORElearn_0.9.43    Matrix_1.1-3        Rcpp_0.11.1        
 [5] RcppEigen_0.3.2.1.2 brglm_0.5-9         car_2.0-20          class_7.3-9        
 [9] cluster_1.14.4      codetools_0.2-8     coin_1.0-23         colorspace_1.2-4   
[13] compiler_3.0.2      digest_0.6.4        foreach_1.4.2       gnm_1.0-7          
[17] gtable_0.1.2        gtools_3.4.1        iterators_1.0.7     lme4_1.1-6         
[21] minqa_1.2.3         munsell_0.4.2       mvtnorm_0.9-99992   nlme_3.1-111       
[25] nnet_7.3-7          proto_0.3-10        qvcalc_0.8-8        relimp_1.0-3       
[29] scales_0.2.4        splines_3.0.2       stringr_0.6.2       survival_2.37-4    
[33] tcltk_3.0.2         tools_3.0.2        
@arunsrinivasan
Copy link
Member

geneorama,

What do you get when you do:

with(df, ave(DATES, LEVELS, FUN=min))

??

@arunsrinivasan
Copy link
Member

Well, if you get the same result using ave, then it sounds like a base R issue (if at all). Why do you think it's a data.table issue?

@geneorama
Copy link
Author

@arunsrinivasan I get :

> `with(df, ave(DATES, LEVELS, FUN=min))`
> [1] "2010-01-01 GMT" "2010-01-01 GMT" "2010-01-01 GMT" "2010-01-01 GMT"
> [5] "2010-01-01 GMT" "2010-01-06 GMT" "2010-01-06 GMT" "2010-01-06 GMT"
> [9] "2010-01-06 GMT" "2010-01-06 GMT"

EDIT: After some offline conversation I understand that @arunsrinivasan was expecting that this result would also be shifted by 6 hours on my computer.

@arunsrinivasan
Copy link
Member

@geneorama,

The function you're using min is from base. data.table has no implementation of it to my knowledge. And the issue is not reproducible on my system. That was the reason for me guessing as to whether you get the same result using ave.

It is not "absolutely" useless as it does help identify that the problem is elsewhere. I'll refrain from commenting on this post and leave Matt to it.

@geneorama
Copy link
Author

So, after some back and forth (mostly offline) with @arunsrinivasan I gather the following:
@arunsrinivasan was not able to reproduce the problem initially (in version 1.9.3), but could reproduce the problem in 1.9.2 .

I was not able to install version 1.9.3 to see if this would fix the problem.

@arunsrinivasan
Copy link
Member

And from our email exchanges, your error message was:

vanilla --default-packages= -e "tools::buildVignettes(dir = '.', tangle = TRUE)"' had status 1
 ERROR
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  pdflatex is not available
Calls: <Anonymous> -> texi2pdf -> texi2dvi
Execution halted
Error: Command failed (1)

which to my knowledge is just a google search away: "install pdflatex windows".

@geneorama
Copy link
Author

Oh, I found something else. How does one "install pdflatex windows"?
On Jun 18, 2014 4:10 AM, "arunsrinivasan" notifications@github.com wrote:

And from our email exchanges, your error message was:

vanilla --default-packages= -e "tools::buildVignettes(dir = '.', tangle = TRUE)"' had status 1
ERROR
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
pdflatex is not available
Calls: -> texi2pdf -> texi2dvi
Execution halted
Error: Command failed (1)

which to my knowledge is just a google search away: "install pdflatex
windows".


Reply to this email directly or view it on GitHub
#689 (comment)
.

@arunsrinivasan
Copy link
Member

Gene,
I'm growing tired of this exchange. Guiding you to install external software for normal building of R packages isn't my concern.

If this discussion continues in this manner, I'll have to close this as not-reproducible (on 1.9.3) and invalid.

@geneorama
Copy link
Author

This has been very time consuming for me, and unproductive. So far I know
nothing more than when I started because I do research before sending
messages or posting questions.

I've only been asking you to explain what you mean by your answers. You
said that install pdflatex windows was solved by a simple Google search. I
didn't see that it was so simple. I found that I would have to install a
large program and modify my path, which I don't want to do on a non
personal computer.

Can you tell me how something is normally labeled a bug on github?

This seems like a fair question since you have taken it upon yourself to
have such an active role in the forum.
On Jun 18, 2014 5:27 AM, "arunsrinivasan" notifications@github.com wrote:

Gene,
I'm growing tired of this exchange. Guiding you to install external
software for normal building of R packages isn't my concern.


Reply to this email directly or view it on GitHub
#689 (comment)
.

@arunsrinivasan
Copy link
Member

Closing as not reproducible on 1.9.3. Will re-open if someone reports they're able to reproduce on 1.9.3.

@simonohanlon101
Copy link

@geneorama as far as I can see you have received ample help and time. Arun has responded to your initial report within 24 hours, tried to reproduce the problem on his system, followed up with you offline on multiple occasions and tested different software set-ups. The problem appears to be with your machine environment. This == your problem, so go ahead and help yourself. Working out how to install pdflatex is your concern not the data.table developers. If you don't own your computer ask your IT support to help you.

FWIW I cannot reproduce this (i.e. I get the expected results) on my 1.9.2 system:

R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] geiger_2.0.3     ape_3.1-2        data.table_1.9.2

loaded via a namespace (and not attached):
 [1] coda_0.16-1       deSolve_1.10-8    digest_0.6.4      grid_3.1.0        lattice_0.20-29   MASS_7.3-33       mvtnorm_0.9-99992 nlme_3.1-117      plyr_1.8.1        Rcpp_0.11.2       reshape2_1.4      stringr_0.6.2     subplex_1.1-3    
[14] tools_3.1.0      

Gives me:

df
    LEVELS      DATES    MINDATE
 1:      a 2010-01-01 2010-01-01
 2:      a 2010-01-02 2010-01-01
 3:      a 2010-01-03 2010-01-01
 4:      a 2010-01-04 2010-01-01
 5:      a 2010-01-05 2010-01-01
 6:      b 2010-01-06       <NA>
 7:      b 2010-01-07       <NA>
 8:      b 2010-01-08       <NA>
 9:      b 2010-01-09       <NA>
10:      b 2010-01-10       <NA>

@simonohanlon101
Copy link

On another note, I notice in your session info you have a lot of packages loaded. If it wasn't already suggested, I would load a fresh session and only load data.table and run the code.

@geneorama
Copy link
Author

@simonohanlon101 I was able to reproduce this error with only data table loaded on a linux server and a windows desktop. I only copied in my systemInfo() later to show that I had also tested it on linux and that it's not an issue with the OS.

The real problem with upgrading to 1.9.3 is that it's a development version. I can't expect other people to be able to reproduce my work if they have to go through a complicated install. Am I supposed to tell people that want to know about vacant and abandoned buildings that they need to install a latex compiler, add it to their system path, so that devtools::install_github will work, so that they can reconstuct a time series?

I don't know why you can't reproduce the problem. I'm running it here on a third machine and getting the same issue.

> library(data.table)
data.table 1.9.2  For help type: help("data.table")
> ## Create the example
> df <- data.table(LEVELS = rep(letters[1:2], each=5),
+ DATES = as.POSIXct('2010-01-01', tz="gmt") +
+ seq(0, 86400*9, 86400))
> ## Create a new column with the minimum date
> df[LEVELS=="a", MINDATE := min(DATES), LEVELS]
> ## The results are not what I expect
> df
    LEVELS      DATES             MINDATE
 1:      a 2010-01-01 2009-12-31 18:00:00
 2:      a 2010-01-02 2009-12-31 18:00:00
 3:      a 2010-01-03 2009-12-31 18:00:00
 4:      a 2010-01-04 2009-12-31 18:00:00
 5:      a 2010-01-05 2009-12-31 18:00:00
 6:      b 2010-01-06                <NA>
 7:      b 2010-01-07                <NA>
 8:      b 2010-01-08                <NA>
 9:      b 2010-01-09                <NA>
10:      b 2010-01-10                <NA>
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2

loaded via a namespace (and not attached):
[1] plyr_1.8.1    Rcpp_0.11.2   reshape2_1.4  stringr_0.6.2 tools_3.0.2  
>

@simonohanlon101
Copy link

@geneorama I can't reproduce the problem because my timezone is "gmt". The issue (in 1.9.2 not 1.9.3) seems to be caused by the timezone attribute being stripped from the Date class. @arunsrinivasan has tracked the source of the problem and is going to revert it once he is sure nothing else will break.

@arunsrinivasan
Copy link
Member

@simonohanlon101 the problem's been already fixed in 1.9.3 (as you've showed from 1.9.3 as well). I'm just going to confirm it.

@geneorama
Copy link
Author

@simonohanlon101 That's exactly what I said in my original post. I spent a lot of time figuring out the problem.

@arunsrinivasan
Copy link
Member

To wrap this post for good:

This bug is reproducible on 1.9.2. And was fixed in 1.9.3 bug #36
I downloaded the version before this commit (which is commit 1168 on github) and tested on OP's data and I get this:

> df
    LEVELS      DATES             MINDATE
 1:      a 2010-01-01 2010-01-01 01:00:00
 2:      a 2010-01-02 2010-01-01 01:00:00
 3:      a 2010-01-03 2010-01-01 01:00:00
 4:      a 2010-01-04 2010-01-01 01:00:00
 5:      a 2010-01-05 2010-01-01 01:00:00
 6:      b 2010-01-06                <NA>
 7:      b 2010-01-07                <NA>
 8:      b 2010-01-08                <NA>
 9:      b 2010-01-09                <NA>
10:      b 2010-01-10                <NA>

And then I tested it on the commit which fixed #36 (commit 1169) and got this:

> df
    LEVELS      DATES    MINDATE
 1:      a 2010-01-01 2010-01-01
 2:      a 2010-01-02 2010-01-01
 3:      a 2010-01-03 2010-01-01
 4:      a 2010-01-04 2010-01-01
 5:      a 2010-01-05 2010-01-01
 6:      b 2010-01-06       <NA>
 7:      b 2010-01-07       <NA>
 8:      b 2010-01-08       <NA>
 9:      b 2010-01-09       <NA>
10:      b 2010-01-10       <NA>

The "attribute" tzone was missing in the earlier one, which was fixed in 1169.

In summary, this has been fixed in 1.9.3. And there's nothing more to discuss.

@mattdowle
Copy link
Member

@geneorama Hi. Just to mention a few things that might not have been clear ...

  1. Your initial report was excellent, detailed and much appreciated. We don't often get as much detail!
  2. We have only just moved to GitHub, last weekend! We're still getting used to it. In particular, issues on Windows were anticipated and I asked for feedback on datatable-help here. One person so far has responded to say they had no troubles at all on Windows. My sense is that your IT dept will let you install pdflatex / Rtools and that many people in big organisation do just that. After all this isn't just data.table but many other packages that are on GitHub. Having said that, R-Forge used to compile and build the development version on Windows .zip for us. Had we not moved you would have been able to point to the R-Forge repo and just install from there. Obviously that was more convenient. However the R-Forge build process was often unreliable, so the advantages of GitHub outweighed and we moved. There is a free and uber reliable service that builds a Windows .zip (winbuilder) and we may be able to hook that up to Travis somehow. However, install_github() does work on Windows for many people at big organizations with IT depts.
  3. The latest news from the development version is on the new github page and I was careful to redirect the old NEWS file on R-Forge to the new one. Whenever a user hits any problem, they are supposed to check the README.md first (search it automatically, or read it through). We've moved NEWS to the README.md so everything is in one place on the front of the github project page. In this case, v1.9.3 indeed has a relevant item :

Using by columns with attributes (ex: factor, Date) in j did not retain the attributes, also in case of :=. This was partially a regression from an earlier fix (bug # 2531) due to recent changes for R3.1.0. Now fixed and clearer tests added. Thanks to Christophe Dervieux for reporting and to Adam B for reporting here on SO: http://stackoverflow.com/questions/22536586/by-seems-to-not-retain-attribute-of-date-type-columns-in-data-table-possibl. Closes # 5437.

The general idea is that you would have found this item in README.md first, considered it (just) plausable that v1.9.3 might fix the issue (it mentions Date as an example which is close to POSIXct), and try v1.9.3, before raising an issue. Or just try the development version anyway without even reading README.md.

Why isn't v1.9.3 on CRAN yet? See paragraph at the top of README.md.

In my mind, the question now is : have you installed Rtools which includes pdflatex I believe and more. Windows users at many big corporations have reported success with it. More information can be found online via devtools package or directly searching for "Rtools". I believe.

Matt

@geneorama
Copy link
Author

@mattdowle
Matt,

Thank you for the reply, and sorry to bother (all) of you. I will respond in sequence.

  1. Thanks. I know that you try to be friendlier than r-help, but I still try to make good observations. I actually had more detail that I was hoping to get to in subsequent dialogue, but I took it out of the question to keep the message simple and clear.
  2. I generally love github, but after two years I'm still getting used to it. For example, I was still editing some early messages and didn't mean to post them as they were. I do have RTools installed, and install_github is working fine. In fact, inspired by you, I've been using it for my own little package, geneorama. I don't follow what you mean about Travis, but I could install 1.9.3 from source. I just don't want to install it on our production / research servers in the off chance that something unintended happens.
  3. I admit I have not thoroughly read the readme, I just noticed it recently. I usually focus on rechecking FAQ, which I think changes with each version. Many of the issues that you and your team deal with are quite mind bending, and I do my best to keep up.


    The SO question is interesting: http://stackoverflow.com/questions/22536586/by-seems-to-not-retain-attribute-of-date-type-columns-in-data-table-possibl and indeed relevant. I solved the problem by using your IDateTime / IDate format, but I can imagine that as.Date would work too. In any event the problem's solved for me, I was just trying to help you guys debug any POSIX issues. Date / Time issues are surprisingly annoying and complicated.

Speaking of annoying and complicated, I don't know why I'm getting the pdf latex error; but I don't really care. Latex has gone from my "on watch" list to "dead to me". Soon I'll set up another linux instance to for this sort of testing, I just don't have access to some of my normal test environments right now in my new position.

I think it's great that 1.9.3 isn't on CRAN yet, it's wonderful that you have a development cycle and a dev version. However, I didn't understand the response of "it's not a bug in 1.9.3" to mean "we've fixed this and it will be coming soon to a theater near you". I heard (in order) "try ave, this isn't a bug, I can't reproduce this, I don't get the same result on my Mac".

Good luck in LA. I wish I could say that I'll see you there... I hope you can stop reading github posts and work on your workshop. As always, I deeply appreciate everyone's hard work in making data.table the best thing that's happened to R since R Studio.

@arunsrinivasan
Copy link
Member

You've made several edits, to your post adding sessionInfo, and edited out other posts so they lose continuity, after our email exchange.

And you write:

However, I didn't understand the response of "it's not a bug in 1.9.3" to mean "we've fixed this and it will be coming soon to a theater near you"

which makes little sense, because from our email exchange:

Great. So you're using 1.9.2. This seems to have been fixed in 1.9.3 (current devel), although I can't see what. Please try updating to 1.9.3 with devtools:::install_github("data.table", "Rdatatable") and see if it fixes your case.

You write:

This has been very time consuming for me, and unproductive. So far I know nothing more than when I started because I do research before sending messages or posting questions.

It's very time consuming for everyone. And what you say about that you knew nothing than when you started is plain wrong, given the email message quoted above.

@geneorama
Copy link
Author

@arunsrinivasan Yes, I did edit my posts to add information and focus on the issue. I wish that this hadn't been so time consuming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants