-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAX Parser ignores explicitly set 'UTF-8' encoding and proceeds to reencode the document resulting in double-encoding artifacts #918
Comments
related issue: #844 |
It's also problematic on JRuby. XmlSaxParserContext#parse_io ignores encoding parameter It would be usefull to have possibility to force encoding. |
I have reported similar problem on StackOverflow -> http://stackoverflow.com/questions/18947249/using-ruby-sax-parsers-for-gb2312-encoded-xml |
Seems like we can use |
Previously, this functionality worked fine for `#parse_io` but didn't work for `#parse_memory`. This change introduces a new optional encoding parameter to `SAX::Parser#parse_memory` and `SAX::ParserContext.memory`, and makes sure to use that encoding or the one passed to the Parser's initializer. This change also makes optional the encoding_id parameter to `SAX::ParserContext.io`, which was previously required. Closes #918
Proposed fix is at #3282 And I have another PR in progress to clean up all of this code and use real encoding names (and not these libxml2-specific encoding IDs). |
Previously, this functionality worked fine for `#parse_io` but didn't work for `#parse_memory`. This change introduces a new optional encoding parameter to `SAX::Parser#parse_memory` and `SAX::ParserContext.memory`, and makes sure to use that encoding or the one passed to the Parser's initializer. This change also makes optional the encoding_id parameter to `SAX::ParserContext.io`, which was previously required. Finally, this commit also backfills similar test coverage for the HTML4 sax parser encoding, which should help with an upcoming big refactor. Closes #918
Previously, encoding overrides were not implemented for XML::SAX::Parser#parse_memory (as reported in #918) and XML::SAX::Parser#parse_file. However, this commit goes further and significantly simplifies and unifies the two SAX::ParserContext implementations and the two SAX::Parser implementations. This commit also allows Encoding objects and encoding names to be passed into the SAX::ParserContext methods, and the XML memory and file methods now accept and properly use passed encodings. Finally, this commit also backfills a lot of test coverage for the XML and the HTML4 sax parser encoding. Closes #918
New proposed fix is at #3288 |
**What problem is this PR intended to solve?** Previously, encoding overrides were not implemented for XML::SAX::Parser#parse_memory (as reported in #918) and XML::SAX::Parser#parse_file. However, this commit goes further and significantly simplifies and unifies the two SAX::ParserContext implementations and the two SAX::Parser implementations. This commit also allows Encoding objects and encoding names to be passed into the SAX::ParserContext methods, and the XML memory and file methods now accept and properly use passed encodings. Finally, this commit also backfills a lot of test coverage for the XML and the HTML4 sax parser encoding. Closes #918 **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** Yes, but they are more consistent with each other.
Unlike regular Nokogiri::XML parser which allows you to explicitly set the encoding, Nokogiri::XML::SAX::Parser ignores the explicitly set encoding and uses the one specified in the provided xml(which has already been encoded in UTF-8, but the contents havent changed):
With a regular parser however, when encoding is explicitly specified (per instructions), its respected and double conversion does not occur:
The text was updated successfully, but these errors were encountered: