substitute_entities #67

troelskn · 2009-05-28T13:23:05Z

According to the documentation, you can change the internal string encoding that Nokogiri uses. There are a couple of problems with this.

First, it doesn't work as advertised. If you set substitute_entities to a boolean, you get an error. If you set it to an integer, it doesn't change behaviour. I'm sure this can be fixed somehow.

The second problem is more severe. Setting such behaviour globally is a really bad idea (tm). If two modules both use Nokogiri, and expect different behaviour, they would not be able to coexist. You should at the very least move this setting to the document-level - Perhaps even provide an alternative for Node#to_s that gives back an explicit encoding. In any case, I was rather confused by the choice of html-escaped encoding for strings. I would have expected utf-8 to be default.

The text was updated successfully, but these errors were encountered:

troelskn · 2009-05-28T13:49:45Z

I just realised that attributes are returned as UTF-8, which makes it rater inconsistent to return entity-encoded strings for text-nodes:

doc = Nokogiri::XML("<?xml version='1.0' encoding='utf-8' ?><root foo='\xc3\xa5'>\xc3\xa5</root>")
puts doc.xpath('//root').first[:foo]
puts doc.xpath('//root/text()').first.to_s

flavorjones · 2009-05-29T21:42:52Z

Hi! Thanks for using Nokogiri, and for reporting your issues. We love hearing from our users.

I'm sorry I didn't respond sooner, but when a test case is not provided, it often means we need to schedule some time to reproduce the issues you're talking about.

First, I completely agree with you that setting such behavior globally is a Really Bad Idea. Unfortunately, since Nokogiri uses libxml2, we are bound to their API, and as you can see here and read about here, this is simply how libxml2 implements the functionality, and there's not much we can do about that in the short-term.

Next, you are absolutely correct that passing 'true' results in an error. For now, passing a 1 or a 0 should be sufficient.

However, it does appear that this functionality is broken in at least some versions of libxml2. I'll double check with a straight-C program, and let you know if I can get it to work at all.

Lastly, I am a little confused. You seem to be conflating the issue of entity-escaping with encoding. These are two distinct functions. You can choose your own encoding when you serialize the document. If you want UTF-8 in your above example, simply run:

puts doc.xpath('//root/text()').first.to_xml(:encoding => 'UTF-8')

If you have a specific case (other than libxml2's apparently-broken default entity escaping setting) in which Nokogiri isn't behaving the way you expect it to, please send us a failing test which is explicit about what your expectations are.

flavorjones · 2009-05-30T00:38:07Z

Commit 7c06969 fixes doc strings and includes a (failing) substitute_entities test.

troelskn · 2009-05-30T09:29:13Z

First, I completely agree with you that setting such behavior globally is a Really Bad Idea. Unfortunately, since Nokogiri uses libxml2, we are bound to their API, and as you can see here and read about here, this is simply how libxml2 implements the functionality, and there's not much we can do about that in the short-term.

I see. In that case, it would probably be a good idea to dictate one behaviour or the other. That way, people won't run into incompatibilities.

Lastly, I am a little confused. You seem to be conflating the issue of entity-escaping with encoding. These are two distinct functions. You can choose your own encoding when you serialize the document.

I assume that what happens internally, is that if you try to serialize a string, which isn't represented in the target encoding (eg. iso-8859-1, utf-8 etc.), you get entities instead. So when you simply call to_s, I must assume that the encoding ascii is assumed, meaning that any non-ascii characters becomes entities? I would propose that the default encoding to output should be utf-8. In other words, I would assume that:

doc = Nokogiri::XML("<?xml version='1.0' encoding='utf-8' ?><root>\xc3\xa5</root>")
str_default = doc.xpath('//root/text()').first.to_s
str_utf8 = doc.xpath('//root/text()').first.serialize('UTF-8')
assert_equal str_default, str_utf8

flavorjones · 2009-05-30T13:21:50Z

removing substitute_entities= and load_external_subsets=. there are other non-global ways to do this. closed by 0a7479a.

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

substitute_entities #67

substitute_entities #67

troelskn commented May 28, 2009

troelskn commented May 28, 2009

flavorjones commented May 29, 2009

flavorjones commented May 30, 2009

troelskn commented May 30, 2009

flavorjones commented May 30, 2009

substitute_entities #67

substitute_entities #67

Comments

troelskn commented May 28, 2009

troelskn commented May 28, 2009

flavorjones commented May 29, 2009

flavorjones commented May 30, 2009

troelskn commented May 30, 2009

flavorjones commented May 30, 2009