Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

full_join generates <NA> entries when joining character vectors with different encodings #2271

Closed
jarauh opened this issue Nov 29, 2016 · 2 comments
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@jarauh
Copy link

jarauh commented Nov 29, 2016

This is another facet of the wellknown encodings problem (e.g. #1885). Sorry if the example is more complicated than necessary.

library(dplyr)
x <- "fa\xE7ile"
xx <- iconv(x, "latin1", "UTF-8")

x == xx  # TRUE

left <- matrix(c(x, "facile", "1", "2"), ncol = 2)
colnames(left) <- c("c1", "c2")
left <- data.frame(left, stringsAsFactors = FALSE)
right <- matrix(c(xx, "facile", "A", "B"), ncol = 2)
colnames(right) <- c("c1", "c3")
right <- data.frame(right, stringsAsFactors = FALSE)

full_join(left, right, by = "c1")

Output:

      c1   c2   c3
1 façile    1 <NA>
2 facile    2    B
3   <NA> <NA>    A

Note the last row that contains an entry in the column that is used for joining.

@krlmlr
Copy link
Member

krlmlr commented Nov 29, 2016

Thanks. I think you should be using UTF-8 only for column data, dplyr will be more careful about that in the future.

@krlmlr krlmlr added data frame bug an unexpected problem or unintended behavior labels Nov 29, 2016
@jarauh
Copy link
Author

jarauh commented Nov 29, 2016

@krlmlr I agree. I can work around this, but I wanted to report it, since even knowing that comparing strings with different encodings is problematic, having new entries appear in the by-column when doing a full_join seems like a separate bug.

@krlmlr krlmlr added this to the data frame 1 milestone Feb 20, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants