print.xml_nodeset very slow for document with one huge node #366

MichaelChirico · 2022-06-03T21:43:52Z

Found in the XML representation of an edge case R file:

library(xml2)
library(xmlparsedata)

p = parse("https://raw.githubusercontent.com/mwaldstein/edgarWebR/fb9a38e6a57186ffd1c93cc1aa00c4fdf1bc5514/tests/cache/browse-edgar-11457c.R")
xml = read_xml(xml_parse_data(p))

Printing this is painfully slow:

system.time(print(xml))
# {xml_document}
# <exprlist>
# [1] <expr line1="1" col1="1" line2="5944" col2="43" start="145" end="855979">\n  <expr line1="1" col1="1" line2="1" col2="9" start="145" end="153">\n    <SYMBOL_FUNCTION_CALL li ...
#    user  system elapsed 
#   2.906   0.048   2.958

Took a brief look, it looks like encodeString() is the culprit:

# ** debugging inside show_nodes() **
system.time(vapply(x, as.character, FUN.VALUE = character(1)))
#    user  system elapsed 
#   0.248   0.017   0.268 
system.time(encodeString(vapply(x, as.character, FUN.VALUE = character(1))))
#    user  system elapsed 
#   2.959   0.024   3.007

Is it possible to apply substr() twice -- once after as.character(), then again after encodeString()?

chr = vapply(x, as.character, FUN.VALUE = character(1))
nchar(chr)
# [1] 18965721

This is clearly already wayyy to wide (width = 180 for me).

I believe we can always just apply

x %>%
  substring(1, n) %>%
  encodeString() %>%
  substring(1, n)

since the default behavior of encodeString() is to simply add \ to non-printable characters, so it will just be a weakly wider version of the input.

Happy to file a PR if that sounds good.

The text was updated successfully, but these errors were encountered:

hadley · 2023-10-30T18:28:13Z

A PR would be great if you still care about this problem 😄

MichaelChirico changed the title ~~print.xml_nodeset very slow for huge document~~ print.xml_nodeset very slow for document with one huge node Jun 3, 2022

hadley added the feature a feature request or enhancement label Oct 30, 2023

MichaelChirico mentioned this issue Oct 31, 2023

double-truncate to avoid inefficient encodeString() usage #413

Merged

hadley closed this as completed in #413 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

print.xml_nodeset very slow for document with one huge node #366

print.xml_nodeset very slow for document with one huge node #366

MichaelChirico commented Jun 3, 2022 •

edited

Loading

hadley commented Oct 30, 2023

print.xml_nodeset very slow for document with one huge node #366

print.xml_nodeset very slow for document with one huge node #366

Comments

MichaelChirico commented Jun 3, 2022 • edited Loading

hadley commented Oct 30, 2023

MichaelChirico commented Jun 3, 2022 •

edited

Loading