Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Support converting Nominal-Text -> Nominal/Ordinal #1633

Closed
JorisGoosen opened this issue Feb 23, 2022 · 8 comments
Closed

Comments

@JorisGoosen
Copy link
Contributor

JorisGoosen commented Feb 23, 2022

This has come up often and we added some feedback for users on why a column can't be converted to another type at jasp-stats/jasp-desktop@769383d for our internal issue https://github.com/jasp-stats/INTERNAL-jasp/issues/977

#1258 is perhaps related as well as #1581

But now also @EJWagenmakers was asking me about it and I think it shouldn't be so very hard to just do the following:

Support Nominal-Text -> Nominal/Ordinal
Where we drop the original strings as in, for instance, a csv file and assign an integral value based on the order at the moment of conversion. This will "lose" some information but this is not so bad.

Converting to scalar would then still fail, because then even the labels would be lost and I suppose that is not what one wants?
On the other hand, a messagebox asking the user whether they are ok with losing the data could also be done I suppose.

And also this: https://github.com/jasp-stats/INTERNAL-jasp/issues/1397

@JoKeyser
Copy link
Member

Sorry if I add just noise, as I'm out of my comfort zone: But maybe there is a better way than a conversion "based on the order at the moment of conversion"? It seems that this may lead to confusing effects if a user changes/adds/removes strings?
Maybe there is some "canonical" conversion based on the strings actual binary representation? If so, would that not be more stable?

@vandenman
Copy link
Contributor

I'd do whatever R does when it converts character to factor:

set.seed(123)
c <- sample(letters, 5)
f <- factor(c)
# alphabetical
print(data.frame(
  character = c,
  factor    = f,
  integer   = as.integer(f)
), row.names = FALSE)
#>  character factor integer
#>          o      o       4
#>          s      s       5
#>          n      n       3
#>          c      c       1
#>          j      j       2

c <- c("", "", letters[seq(3, 1, -1)])
f <- factor(c)
# no idea what determines the order for the chinese characters
print(data.frame(
  character = c,
  factor    = f,
  integer   = as.integer(f)
), row.names = FALSE)
#>  character factor integer
#>         汉     汉       5
#>         字     字       4
#>          c      c       3
#>          b      b       2
#>          a      a       1

Maybe there is some "canonical" conversion based on the strings actual binary representation

Perhaps that's what R does? I think it just sorts the unique values and uses that to assign integer values.

@JoKeyser
Copy link
Member

JoKeyser commented Feb 23, 2022

@vandenman well I meant a conversion based on the actual string "value", not its ordered position.
However, I think I realize that this idea is limited by the fact that strings could be arbitrarily long, so there is no feasible conversion.
And my issue with using factor() is purely theoretical. Probably it's best to use something simple like that, and see if any real-world problems arise.

@JorisGoosen
Copy link
Contributor Author

Sorry if I add just noise, as I'm out of my comfort zone: But maybe there is a better way than a conversion "based on the order at the moment of conversion"? It seems that this may lead to confusing effects if a user changes/adds/removes strings? Maybe there is some "canonical" conversion based on the strings actual binary representation? If so, would that not be more stable?

Well, actually this is how we use them inside analyses already, if you change the order of the labels in the variableswindow then that will change the order in the resulting factor that is fed to the analysis, and thus anything in R that depends on that.

To make that a bit more clear, when we feed the nominal-text column to R now it is in fact converted into a factor. Running from 1 to columnLength in the exact order of the labels as seen in the variableswindow.

So just using that when converting to nominal and ordinal should be alright. And it allows for users deciding the order of their scales and things like that which I assume they want. (And we wouldn't get if we just order it based on the strings)

@shun2wang
Copy link
Contributor

no idea what determines the order for the chinese characters

that's ok,The order of Chinese characters is usually not considered because in quantitative data analysis practice, Chinese characters are generally used as labels but not treated as values. If ordering is to be considered, I would suggest ordering by value.

@vandenman
Copy link
Contributor

To make that a bit more clear, when we feed the nominal-text column to R now it is in fact converted into a factor. Running from 1 to columnLength in the exact order of the labels as seen in the variableswindow.

I think this makes sense.

So just using that when converting to nominal and ordinal should be alright. And it allows for users deciding the order of their scales and things like that which I assume they want. (And we wouldn't get if we just order it based on the strings)

Also makes sense. There is one edge case though that I would check for. In R, this situation can occur:

f <- factor(as.character(1:11))
f # order from sorting 1:11 as strings
#>  [1] 1  2  3  4  5  6  7  8  9  10 11
#> Levels: 1 10 11 2 3 4 5 6 7 8 9
fSorted <- factor(f, levels = sort(as.numeric(levels(f))))
fSorted # order from sorting 1:11 as numbers
#>  [1] 1  2  3  4  5  6  7  8  9  10 11
#> Levels: 1 2 3 4 5 6 7 8 9 10 11

where the default levels (first print) have order 1 10 11 2 ... because of string sorting. The second ordering 1 2 3 4 ... is probably closer to what people expect.

Also, I'd imagine this is just the default conversion from nominal text to Nominal/ ordinal. Afterward, people should be able to change the order and labels in any way they want.

that's ok,The order of Chinese characters is usually not considered because in quantitative data analysis practice, Chinese characters are generally used as labels but not treated as values. If ordering is to be considered, I would suggest ordering by value.

Sure, but the issue is that we need a consistent way to assign values to text. That text may consist of Chinese characters, Hebrew symbols, or who knows what kind of characters. However, initially, there is no value we can use to order by.

@tomtomme
Copy link
Member

@JorisGoosen
Nominal Text was axed, correct?
So this issue is now solved?

@JorisGoosen
Copy link
Contributor Author

Indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants