Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column attribute round-tripping is too strict for datetimes #598

Open
yjunechoe opened this issue Feb 7, 2025 · 4 comments
Open

Column attribute round-tripping is too strict for datetimes #598

yjunechoe opened this issue Feb 7, 2025 · 4 comments
Labels
duckdb 🦆 Issues where work in the duckb package is needed

Comments

@yjunechoe
Copy link

A date time value without a set timezone (ex: value from Sys.time()) gains a timezone attribute in the duckdb conversion, thus failing the check for attribute round-tripping:

library(duckplyr)

df <- data.frame(datetime = Sys.time())
attributes(df$datetime)
#> $class
#> [1] "POSIXct" "POSIXt"

as_duckdb_tibble(df)
#> Error in `duckdb_rel_from_df()`:
#> ! Attributes are lost during conversion. Affected column: `datetime`.

lubridate::tz(df$datetime) <- "UTC"
attributes(df$datetime)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] "UTC"

as_duckdb_tibble(df)
#> # A duckplyr data frame: 1 variable
#>   datetime           
#>   <dttm>             
#> 1 2025-02-07 09:37:25

It seems to me that the check !identical(df_attrib, roundtrip_attrib) is too strict. I'm actually surprised by the identical() requirement here, especially when the error message is about attributes being dropped, which is not the case in the reprex.

for (i in seq_along(df)) {
df_attrib <- attributes(df[[i]])
roundtrip_attrib <- attributes(roundtrip[[i]])
if (!identical(df_attrib, roundtrip_attrib)) {
cli::cli_abort("Attributes are lost during conversion. Affected column: {.var {names(df)[[i]]}}.", call = call)
}
}


Relatedly -- and I can open a separate issue about this if needed -- there are also cases where duckdb can shuffle the order of attributes without changing their contents, and I wonder whether the check should let those pass (or raise a warning vs. error).

@krlmlr
Copy link
Member

krlmlr commented Feb 7, 2025

Thanks. Yes, it's unfortunate. I'd like to focus efforts on supporting times with time zones in duckdb so that we don't need this check.

Please open a separate issue for the shuffling you mentioned, would it belong in the duckdb-r repo?

@krlmlr krlmlr added the duckdb 🦆 Issues where work in the duckb package is needed label Feb 7, 2025
@yjunechoe
Copy link
Author

yjunechoe commented Feb 7, 2025

Please open a separate issue for the shuffling you mentioned, would it belong in the duckdb-r repo?

Sorry my mistake! The shuffling actually comes from dplyr's slice() (regardless of whether the method is overwritten or not): I've opened an issue there tidyverse/dplyr#7655

library(duckplyr)
df <- data.frame(datetime = lubridate::now(tzone = "UTC"))
attributes(df$datetime)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] "UTC"

df_sliced <- df |> slice()
attributes(df_sliced$datetime)
#> $tzone
#> [1] "UTC"
#> 
#> $class
#> [1] "POSIXct" "POSIXt"

df_duckdb <- as_duckdb_tibble(df_sliced)
#> Error in `duckdb_rel_from_df()`:
#> ! Attributes are lost during conversion. Affected column:
#>   `datetime`.

@krlmlr
Copy link
Member

krlmlr commented Feb 10, 2025

Thanks, I see. Should this be setequal(names(...), names(...)) && identical(attr1, attr2[names(attr1)]) ? Would you like to PR?

@yjunechoe
Copy link
Author

Sure! And just to be precise, in the snippet of code I linked to above, the local variables for my reprex are:

df_attrib
#> $class
#> [1] "POSIXct" "POSIXt" 

roundtrip_attrib
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] "UTC"

So setequal() on names would still be too strict.

If the check should simply test whether any of the original attributes have been dropped or changed (i.e., ignore any new attributes that may have been created), one implementation could be:

!(
  all(names(df_attrib) %in% names(roundtrip_attrib)) &&      # all attribute fields preserved
    identical(roundtrip_attrib[names(df_attrib)], df_attrib) # values of fields are identical
)

But even with the above there's one thorny case of, e.g., as.POSIXct(character(0)) where "unspecified timezone" is translated as tzone = "":

df <- data.frame(datetime = as.POSIXct(character(0)))
attributes(df$datetime)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] ""

as_duckdb_tibble(df)
#> Error in `duckdb_rel_from_df()` at duckplyr/R/duckplyr_df.R:20:5:
#> ! Attributes are lost during conversion. Affected column: `datetime`.

So perhaps additional work is required to either treat tzone="" as NULL, or more broadly, don't test the value of zero-length attributes (and only whether field is preserved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb 🦆 Issues where work in the duckb package is needed
Projects
None yet
Development

No branches or pull requests

2 participants