Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

substitute_entities #67

Closed
troelskn opened this issue May 28, 2009 · 5 comments
Closed

substitute_entities #67

troelskn opened this issue May 28, 2009 · 5 comments

Comments

@troelskn
Copy link

According to the documentation, you can change the internal string encoding that Nokogiri uses. There are a couple of problems with this.

First, it doesn't work as advertised. If you set substitute_entities to a boolean, you get an error. If you set it to an integer, it doesn't change behaviour. I'm sure this can be fixed somehow.

The second problem is more severe. Setting such behaviour globally is a really bad idea (tm). If two modules both use Nokogiri, and expect different behaviour, they would not be able to coexist. You should at the very least move this setting to the document-level - Perhaps even provide an alternative for Node#to_s that gives back an explicit encoding. In any case, I was rather confused by the choice of html-escaped encoding for strings. I would have expected utf-8 to be default.

@troelskn
Copy link
Author

I just realised that attributes are returned as UTF-8, which makes it rater inconsistent to return entity-encoded strings for text-nodes:

doc = Nokogiri::XML("<?xml version='1.0' encoding='utf-8' ?><root foo='\xc3\xa5'>\xc3\xa5</root>")
puts doc.xpath('//root').first[:foo]
puts doc.xpath('//root/text()').first.to_s

@flavorjones
Copy link
Member

Hi! Thanks for using Nokogiri, and for reporting your issues. We love hearing from our users.

I'm sorry I didn't respond sooner, but when a test case is not provided, it often means we need to schedule some time to reproduce the issues you're talking about.

First, I completely agree with you that setting such behavior globally is a Really Bad Idea. Unfortunately, since Nokogiri uses libxml2, we are bound to their API, and as you can see here and read about here, this is simply how libxml2 implements the functionality, and there's not much we can do about that in the short-term.

Next, you are absolutely correct that passing 'true' results in an error. For now, passing a 1 or a 0 should be sufficient.

However, it does appear that this functionality is broken in at least some versions of libxml2. I'll double check with a straight-C program, and let you know if I can get it to work at all.

Lastly, I am a little confused. You seem to be conflating the issue of entity-escaping with encoding. These are two distinct functions. You can choose your own encoding when you serialize the document. If you want UTF-8 in your above example, simply run:

puts doc.xpath('//root/text()').first.to_xml(:encoding => 'UTF-8')

If you have a specific case (other than libxml2's apparently-broken default entity escaping setting) in which Nokogiri isn't behaving the way you expect it to, please send us a failing test which is explicit about what your expectations are.

@flavorjones
Copy link
Member

Commit 7c06969 fixes doc strings and includes a (failing) substitute_entities test.

@troelskn
Copy link
Author

First, I completely agree with you that setting such behavior globally is a Really Bad Idea. Unfortunately, since Nokogiri uses libxml2, we are bound to their API, and as you can see here and read about here, this is simply how libxml2 implements the functionality, and there's not much we can do about that in the short-term.

I see. In that case, it would probably be a good idea to dictate one behaviour or the other. That way, people won't run into incompatibilities.

Lastly, I am a little confused. You seem to be conflating the issue of entity-escaping with encoding. These are two distinct functions. You can choose your own encoding when you serialize the document.

I assume that what happens internally, is that if you try to serialize a string, which isn't represented in the target encoding (eg. iso-8859-1, utf-8 etc.), you get entities instead. So when you simply call to_s, I must assume that the encoding ascii is assumed, meaning that any non-ascii characters becomes entities? I would propose that the default encoding to output should be utf-8. In other words, I would assume that:

doc = Nokogiri::XML("<?xml version='1.0' encoding='utf-8' ?><root>\xc3\xa5</root>")
str_default = doc.xpath('//root/text()').first.to_s
str_utf8 = doc.xpath('//root/text()').first.serialize('UTF-8')
assert_equal str_default, str_utf8

@flavorjones
Copy link
Member

removing substitute_entities= and load_external_subsets=. there are other non-global ways to do this. closed by 0a7479a.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants