Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_longer combined with names_sep = "|" pipe symbol gives unexpected results #1503

Closed
BdR76 opened this issue Jun 17, 2023 · 3 comments
Closed

Comments

@BdR76
Copy link

BdR76 commented Jun 17, 2023

I was working on pivoting Excel sheets with labdata and ran into an issue with pivot_longer, also see this question on stackoverflow.

When using pivot_longer you can give a names_sep parameter to split the header names into muliple columns when pivoting the data. However when I use names_sep = "|" it gives unexpected results compared to when using other separator characters. I suspect it has to do with the "names_sep" parameter being interpreted as a regular expression instead of a separator character. See example code below.

# insert reprex here
library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)

#xls_data <- read_excel(path = "labdata_exa2final.xlsx", sheet = "Labdata")
## TEST use dput() to print read_excel() object as a hardcoded data structure for testing purposes
#dput(xls_data_test)
xls_data <- structure(
  list(
    ...1 = c(NA, NA, "LabId", "9", "37", "38", "41"),
    ...2 = c(NA, NA, "LabDescr", "HbA1c", "HDL-Chol", "LDL-Chol", "Glucose"),
    ...3 = c("Tube", "TubePos", "WeightFact", "1,00", "1,00", "1,00", "1,00"),
    EXA_2665 = c("87019054", "C12", NA, "1.784", "0.68400000000000005", "2.4380000000000002", "1.3660000000000001"),
    EXA_2707 = c("87065976", "D01", NA, "0.41899999999999998", "2.1720000000000002", "1.9950000000000001", "1.47"),
    EXA_2971 = c("87093024", "D02", NA, "1.0449999999999999", "1.954", "0.80600000000000005", "0.97799999999999998")
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -7L)
  )

library(unheadr)
# UNEXPECTED OUTPUT when using "|" as separator to split/unsplit the header values
xls_data_1 <- xls_data |>
  mash_colnames(n_name_rows = 3, sep = "|") |>
  pivot_longer(-c(1:3), names_sep = "|", 
               names_to = c(colnames(xls_data)[3], unlist(xls_data[1:3, 3]))) |>
  dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3)
  
# using ";" or "=" or any other separator works as expected
xls_data_2 <- xls_data |>
  mash_colnames(n_name_rows = 3, sep = ";") |>
  pivot_longer(-c(1:3), names_sep = ";", 
               names_to = c(colnames(xls_data)[3], unlist(xls_data[1:3, 3]))) |>
  dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3)

In the resulting dataframe the columns Tube,TubePos, WeightFact contain the header names split on individual characters, so character for character, see xls_data_1 below:

LabId LabDescr WeightFact2 ...3 Tube TubePos WeightFact value
9 HbA1c 1,00 E X A 1.784
9 HbA1c 1,00 E X A 0.41899999999999998
9 HbA1c 1,00 E X A 1.0449999999999999
37 HDL-Chol 1,00 E X A 0.68400000000000005
37 HDL-Chol 1,00 E X A 2.1720000000000002
37 HDL-Chol 1,00 E X A 1.954
38 LDL-Chol 1,00 E X A 2.4380000000000002
38 LDL-Chol 1,00 E X A 1.9950000000000001
38 LDL-Chol 1,00 E X A 0.80600000000000005
41 Glucose 1,00 E X A 1.3660000000000001
41 Glucose 1,00 E X A 1.47
41 Glucose 1,00 E X A 0.97799999999999998

This only seems to happen when using the pipe symbol |, when I use other separators like comma , or semicolon ; then it works fine and the result is as expected, see xls_data_2 below:

LabId LabDescr WeightFact2 ...3 Tube TubePos WeightFact value
9 HbA1c 1,00 EXA_2665 87019054 C12 1.784
9 HbA1c 1,00 EXA_2707 87065976 D01 0.41899999999999998
9 HbA1c 1,00 EXA_2971 87093024 D02 1.0449999999999999
37 HDL-Chol 1,00 EXA_2665 87019054 C12 0.68400000000000005
37 HDL-Chol 1,00 EXA_2707 87065976 D01 2.1720000000000002
37 HDL-Chol 1,00 EXA_2971 87093024 D02 1.954
38 LDL-Chol 1,00 EXA_2665 87019054 C12 2.4380000000000002
38 LDL-Chol 1,00 EXA_2707 87065976 D01 1.9950000000000001
38 LDL-Chol 1,00 EXA_2971 87093024 D02 0.80600000000000005
41 Glucose 1,00 EXA_2665 87019054 C12 1.3660000000000001
41 Glucose 1,00 EXA_2707 87065976 D01 1.47
41 Glucose 1,00 EXA_2971 87093024 D02 0.97799999999999998

Btw it could be that pivot_wider also has this same issue with names_sep I haven't checked. And here's my system info should it be needed:

R version 4.3.0 (2023-04-21 ucrt)
RStudio 2023.03.0 Build 386
tidyverse 2.0.0, tidyr 1.3.0

@BdR76 BdR76 changed the title pivot_longer combined with names_sep = "|" pipe symbol gives unexpected results pivot_longer combined with names_sep = "|" pipe symbol gives unexpected results Jun 17, 2023
@DavisVaughan
Copy link
Member

DavisVaughan commented Jun 20, 2023

Minimum reprex

tidyr:::str_split_n(c("foo;bar", "a;b"), ";")
#> [[1]]
#> [1] "foo" "bar"
#> 
#> [[2]]
#> [1] "a" "b"

tidyr:::str_split_n(c("foo|bar", "a|b"), "|")
#> [[1]]
#> [1] ""  "f" "o" "o" "|" "b" "a" "r"
#> 
#> [[2]]
#> [1] ""  "a" "|" "b"

# escaped backslash
tidyr:::str_split_n(c("foo|bar", "a|b"), "\\|")
#> [[1]]
#> [1] "foo" "bar"
#> 
#> [[2]]
#> [1] "a" "b"

So if you use "\\|" as an escaped backslash it will work, not sure if we can come up with something better

Somewhat related to #1417 (comment) I think

@BdR76
Copy link
Author

BdR76 commented Jun 28, 2023

If I understand correctly, the pivot_longer uses the str_split_n function somewhere internally? Ok but from an end-user perspective it's not obvious from the parameter name names_sep that it can be interpreted as a regex when you want to use | as a separator. Especially in combination with other functions like mash_colnames which have functionally similar parameters called sep. Also, I highly doubt there even is a real-life use-case where someone would want to use "|" as a regex in this context.

So imho pivot_longer should ideally;

  • rename the parameter names_sep to something like names_split_n
  • or add an extra separate parameter for regex, so for example names_regex
  • or make an exception and just interpret names_sep = "|" as names_sep = "\\|"

EDIT: Even if there is some real-life edge-case where you would want to use "|" as regex, you should probably use something like .{1} instead. So I would argue that a single character for names_sep should always be interpreted as a separator, more than one character could be regex.

@hadley
Copy link
Member

hadley commented Nov 1, 2023

I think this is pretty clearly documented:

names_sep takes the same specification as separate() and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).

It's pretty hard to change this sort of behaviour without affecting a lot of existing code and given the seeming rarity of it being a problem, I unfortunately don't think it's worth the effort. (But this is certainly something I imagine we would change if at some point in the distant future we take another stab at these functions.)

@hadley hadley closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants