Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of   is confusing #284

Closed
arilamstein opened this issue Jul 28, 2020 · 5 comments · Fixed by #296
Closed

Treatment of   is confusing #284

arilamstein opened this issue Jul 28, 2020 · 5 comments · Fixed by #296

Comments

@arilamstein
Copy link

arilamstein commented Jul 28, 2020

I am using rvest to scrape data from wikipedia. It appears that in tables, wikipedia sometimes uses   instead of " ". This leads to very strange behavior when the table is imported into R via rvest.

I posted a reproducible example on stackoverflow here, but will copy it here as well.

Part 1:

library(rvest)
library(xml2)
url    = "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file   = xml2::read_html(url)
tables = rvest::html_nodes(file, "table")

reps = rvest::html_table(tables[6])
reps = as.data.frame(reps)[1,1:3]

reps$District
# [1] "Alabama 1"

# I expected this line to return TRUE. 
# I literally copied the output from the R console above to the RHS of the ==
reps$District == "Alabama 1"
# [1] FALSE

# Because the above line returns FALSE, this code returns an empty data.frame
reps[reps$District=="Alabama 1",]
# [1] District Member   Party   
# <0 rows> (or 0-length row.names)

This error is very difficult to explain to people online, because the common way to transmit data / code on e.g. Stack Overlfow is to use dput. But here if you use dput, then the problem disappears.

dput(reps)
# structure(list(District = "Alabama 1", Member = "Bradley Byrne", 
#    Party = NA), row.names = 1L, class = "data.frame")

x=structure(list(District = "Alabama 1", Member = "Bradley Byrne", 
                 Party = NA), row.names = 1L, class = "data.frame")

# now it's TRUE!
x$District=="Alabama 1"
# [1] TRUE

# and so the subset works
x[x$District == "Alabama 1", ]
# District        Member Party
# 1 Alabama 1 Bradley Byrne    NA

I am not sure what the right answer here is. But the current default behavior of what I assume is a common use case of rvest (scraping a table from wikipedia) leads to buggy behavior that is very hard to track down when dealing with the data in R. My first thought is that rvest should, by default, change all &nbsp;s to " " and give users the option to disable this.

@georgevbsantiago
Copy link

I think this "problem" is linked to RStudio. In the image below, the RStudio console displays "Alabama 1", but when inspecting it shows "Alabama&nbsp;1" .

Anotação 2020-08-02 012603

Only with the textutils :: HTMLencode function does the HTML element become transparent as text.

Reprex:

library(rvest)
#> Carregando pacotes exigidos: xml2
library(xml2)
library(textutils)

url    <-  "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file   <-  xml2::read_html(url)
tables <-  rvest::html_nodes(file, "table")

reps <-  rvest::html_table(tables[6])
reps <-  as.data.frame(reps)[1,1:3]

reps$District
#> [1] "Alabama 1"

textutils::HTMLencode(reps$District)
#> [1] "Alabama&nbsp;1"

Created on 2020-08-02 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Dec 14, 2020

I don't think there's much rvest can do about this because spaces and non-breaking spaces are different characters, and rvest never actually sees the &nbsp; because xml2 has already converted it to the underlying unicode. Applying some arbitrary transformation of unicode characters to all rvest outputs seems dangerous to me, so I think the we can do is better document this common case.

@hadley
Copy link
Member

hadley commented Dec 14, 2020

Alternatively, could combine with #175 and have some new transform_ws argument that would convert <br> to "\n" and "\u00A0" to " ".

@hadley
Copy link
Member

hadley commented Dec 20, 2020

Also see regex used for trimming within html_table().

hadley added a commit that referenced this issue Jan 5, 2021
via html_text2() which implements an approach inspired by innerText(), and replaces non-breaking spaces by default.

Fixes #175. Fixes #284.
@arilamstein
Copy link
Author

arilamstein commented Jan 5, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants