We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1. https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files
However, it is tedious to specify encoding manually. Why not doing like this? stri_enc_detect() is making good guess.
stri_enc_detect()
path_data <- system.file("extdata/", package = "readtext") for (f in list.files(paste0(path_data, "/txt/EU_manifestos/"), full.names = TRUE)) { print(f) enc <- stringi::stri_enc_detect(readBin(file(f, 'rb'), character())) print(enc[[1]][1:2,]) }
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_PSE.txt" Encoding Language Confidence 1 ISO-8859-1 de 0.80 2 ISO-8859-9 tr 0.24 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_V.txt" Encoding Language Confidence 1 ISO-8859-1 de 0.83 2 ISO-8859-9 tr 0.26 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_PSE.txt" Encoding Language Confidence 1 ISO-8859-1 en 0.75 2 ISO-8859-2 ro 0.21 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_V.txt" Encoding Language Confidence 1 ISO-8859-1 en 0.75 2 ISO-8859-2 ro 0.21 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_PSE.txt" Encoding Language Confidence 1 ISO-8859-1 es 0.91 2 ISO-8859-2 ro 0.35 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_V.txt" Encoding Language Confidence 1 ISO-8859-1 es 0.88 2 ISO-8859-2 ro 0.36 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fi_V.txt" Encoding Language Confidence 1 ISO-8859-1 sv 0.20 2 ISO-8859-9 tr 0.17 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_PSE.txt" Encoding Language Confidence 1 ISO-8859-1 fr 0.94 2 ISO-8859-2 ro 0.35 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_V.txt" Encoding Language Confidence 1 ISO-8859-1 fr 0.92 2 ISO-8859-2 ro 0.37 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_gr_V.txt" Encoding Language Confidence 1 ISO-8859-7 el 0.74 2 UTF-16BE 0.10 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_hu_V.txt" Encoding Language Confidence 1 ISO-8859-2 hu 0.53 2 ISO-8859-1 en 0.16 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_it_PSE.txt" Encoding Language Confidence 1 ISO-8859-1 it 0.83 2 ISO-8859-2 ro 0.43 [1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_lv_V.txt" Error in enc[[1]] : subscript out of bounds In addition: There were 13 warnings (use warnings() to see them)
The text was updated successfully, but these errors were encountered:
koheiw
amatsuo
No branches or pull requests
The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1.
https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files
However, it is tedious to specify encoding manually. Why not doing like this?
stri_enc_detect()
is making good guess.The text was updated successfully, but these errors were encountered: