-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treatment of is confusing #284
Comments
I think this "problem" is linked to RStudio. In the image below, the RStudio console displays "Alabama 1", but when inspecting it shows "Alabama Only with the Reprex: library(rvest)
#> Carregando pacotes exigidos: xml2
library(xml2)
library(textutils)
url <- "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file <- xml2::read_html(url)
tables <- rvest::html_nodes(file, "table")
reps <- rvest::html_table(tables[6])
reps <- as.data.frame(reps)[1,1:3]
reps$District
#> [1] "Alabama 1"
textutils::HTMLencode(reps$District)
#> [1] "Alabama 1" Created on 2020-08-02 by the reprex package (v0.3.0) |
I don't think there's much rvest can do about this because spaces and non-breaking spaces are different characters, and rvest never actually sees the |
Alternatively, could combine with #175 and have some new |
Also see regex used for trimming within |
Thanks so much for your work on this Hadley.
…On Tue, Jan 5, 2021 at 11:44 AM Hadley Wickham ***@***.***> wrote:
Closed #284 <#284> via #296
<#296>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#284 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMYESCA7Q6UGRNTEN5KDSYNT2VANCNFSM4PKVJ5TQ>
.
|
I am using rvest to scrape data from wikipedia. It appears that in tables, wikipedia sometimes uses
instead of " ". This leads to very strange behavior when the table is imported into R via rvest.I posted a reproducible example on stackoverflow here, but will copy it here as well.
Part 1:
This error is very difficult to explain to people online, because the common way to transmit data / code on e.g. Stack Overlfow is to use dput. But here if you use dput, then the problem disappears.
I am not sure what the right answer here is. But the current default behavior of what I assume is a common use case of rvest (scraping a table from wikipedia) leads to buggy behavior that is very hard to track down when dealing with the data in R. My first thought is that rvest should, by default, change all
s to " " and give users the option to disable this.The text was updated successfully, but these errors were encountered: