Treatment of   is confusing #284

arilamstein · 2020-07-28T15:54:31Z

I am using rvest to scrape data from wikipedia. It appears that in tables, wikipedia sometimes uses   instead of " ". This leads to very strange behavior when the table is imported into R via rvest.

I posted a reproducible example on stackoverflow here, but will copy it here as well.

Part 1:

library(rvest)
library(xml2)
url    = "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file   = xml2::read_html(url)
tables = rvest::html_nodes(file, "table")

reps = rvest::html_table(tables[6])
reps = as.data.frame(reps)[1,1:3]

reps$District
# [1] "Alabama 1"

# I expected this line to return TRUE. 
# I literally copied the output from the R console above to the RHS of the ==
reps$District == "Alabama 1"
# [1] FALSE

# Because the above line returns FALSE, this code returns an empty data.frame
reps[reps$District=="Alabama 1",]
# [1] District Member   Party   
# <0 rows> (or 0-length row.names)

This error is very difficult to explain to people online, because the common way to transmit data / code on e.g. Stack Overlfow is to use dput. But here if you use dput, then the problem disappears.

dput(reps)
# structure(list(District = "Alabama 1", Member = "Bradley Byrne", 
#    Party = NA), row.names = 1L, class = "data.frame")

x=structure(list(District = "Alabama 1", Member = "Bradley Byrne", 
                 Party = NA), row.names = 1L, class = "data.frame")

# now it's TRUE!
x$District=="Alabama 1"
# [1] TRUE

# and so the subset works
x[x$District == "Alabama 1", ]
# District        Member Party
# 1 Alabama 1 Bradley Byrne    NA

I am not sure what the right answer here is. But the current default behavior of what I assume is a common use case of rvest (scraping a table from wikipedia) leads to buggy behavior that is very hard to track down when dealing with the data in R. My first thought is that rvest should, by default, change all  s to " " and give users the option to disable this.

The text was updated successfully, but these errors were encountered:

georgevbsantiago · 2020-08-02T04:36:12Z

I think this "problem" is linked to RStudio. In the image below, the RStudio console displays "Alabama 1", but when inspecting it shows "Alabama 1" .

Only with the textutils :: HTMLencode function does the HTML element become transparent as text.

Reprex:

library(rvest)
#> Carregando pacotes exigidos: xml2
library(xml2)
library(textutils)

url    <-  "https://en.m.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives"
file   <-  xml2::read_html(url)
tables <-  rvest::html_nodes(file, "table")

reps <-  rvest::html_table(tables[6])
reps <-  as.data.frame(reps)[1,1:3]

reps$District
#> [1] "Alabama 1"

textutils::HTMLencode(reps$District)
#> [1] "Alabama&nbsp;1"

^{Created on 2020-08-02 by the reprex package (v0.3.0)}

hadley · 2020-12-14T14:38:51Z

I don't think there's much rvest can do about this because spaces and non-breaking spaces are different characters, and rvest never actually sees the   because xml2 has already converted it to the underlying unicode. Applying some arbitrary transformation of unicode characters to all rvest outputs seems dangerous to me, so I think the we can do is better document this common case.

hadley · 2020-12-14T15:57:18Z

Alternatively, could combine with #175 and have some new transform_ws argument that would convert <br> to "\n" and "\u00A0" to " ".

hadley · 2020-12-20T00:08:15Z

Also see regex used for trimming within html_table().

via html_text2() which implements an approach inspired by innerText(), and replaces non-breaking spaces by default. Fixes #175. Fixes #284.

arilamstein · 2021-01-05T19:58:44Z

Thanks so much for your work on this Hadley.

…

On Tue, Jan 5, 2021 at 11:44 AM Hadley Wickham ***@***.***> wrote: Closed #284 <#284> via #296 <#296>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMYESCA7Q6UGRNTEN5KDSYNT2VANCNFSM4PKVJ5TQ> .

hadley mentioned this issue Dec 14, 2020

Compare individual strings r-lib/waldo#57

Open

hadley added the documentation label Dec 14, 2020

hadley mentioned this issue Jan 5, 2021

Implement human friendly text extraction #296

Merged

hadley closed this as completed in #296 Jan 5, 2021

hadley added a commit that referenced this issue Jan 5, 2021

Implement human friendly text extraction (#296)

439a3e7

via html_text2() which implements an approach inspired by innerText(), and replaces non-breaking spaces by default. Fixes #175. Fixes #284.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treatment of   is confusing #284

Treatment of   is confusing #284

arilamstein commented Jul 28, 2020 •

edited

Loading

georgevbsantiago commented Aug 2, 2020

hadley commented Dec 14, 2020

hadley commented Dec 14, 2020

hadley commented Dec 20, 2020

arilamstein commented Jan 5, 2021 via email

Treatment of &nbsp; is confusing #284

Treatment of &nbsp; is confusing #284

Comments

arilamstein commented Jul 28, 2020 • edited Loading

georgevbsantiago commented Aug 2, 2020

hadley commented Dec 14, 2020

hadley commented Dec 14, 2020

hadley commented Dec 20, 2020

arilamstein commented Jan 5, 2021 via email

Treatment of is confusing #284

Treatment of is confusing #284

arilamstein commented Jul 28, 2020 •

edited

Loading