Skip to content

Commit

Permalink
fix: SAX::Parser#parse_memory allows overriding the input encoding
Browse files Browse the repository at this point in the history
Previously, this functionality worked fine for `#parse_io` but didn't
work for `#parse_memory`.

This change introduces a new optional encoding parameter to
`SAX::Parser#parse_memory` and `SAX::ParserContext.memory`, and makes
sure to use that encoding or the one passed to the Parser's initializer.

This change also makes optional the encoding_id parameter to
`SAX::ParserContext.io`, which was previously required.

Closes #918
  • Loading branch information
flavorjones committed Jul 5, 2024
1 parent af8cd18 commit ec4389e
Show file tree
Hide file tree
Showing 4 changed files with 165 additions and 28 deletions.
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

### Added

* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `XML::SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* `XML::SAX::Parser#parse_memory` now accepts an optional `encoding` argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#918] @flavorjones
* `XML::SAX::ParserContext.memory` now accepts an optional `encoding_id` argument. When not provided, the encoding will be autodetected. [#918] @flavorjones
* [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones
* [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
* [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones
Expand All @@ -28,6 +30,7 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
* Documentation has been improved for `CSS.xpath_for`. [#3224] @flavorjones
* Documentation for the SAX parsing classes has been greatly improved, including the complex entity-handling behavior. [#3265] @flavorjones
* `XML::Schema#read_memory` and `XML::RelaxNG#read_memory` are now Ruby methods that call `#from_document`. Previously these were native functions, but they were buggy on both CRuby and JRuby (but worse on JRuby) and so this is now useful, comparable in performance, and simpler code that is easier to maintain. [#2113, #2115] @flavorjones
* `XML::SAX::ParserContext.io`'s `encoding_id` argument is now optional, and when not provided will default to autodetecting the encoding. [#918] @flavorjones
* [CRuby] When compiling packaged libraries from source, allow users' `AR` and `LD` environment variables to set the archiver and linker commands, respectively. This augments the existing `CC` environment variable to set the compiler command. [#3165] @ziggythehamster
* [CRuby] The HTML5 parse methods accept a `:parse_noscript_content_as_text` keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway
* [CRuby] `HTML5::DocumentFragment.parse` and `.new` accept a `:context` keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to `.new` and not at all to `.parse`. @flavorjones
Expand Down
48 changes: 34 additions & 14 deletions ext/nokogiri/xml_sax_parser_context.c
Original file line number Diff line number Diff line change
Expand Up @@ -46,32 +46,43 @@ noko_xml_sax_parser_context_wrap(VALUE klass, xmlParserCtxtPtr c_context)

/*
* call-seq:
* io(input)
* io(input, encoding_id)
*
* Create a parser context for an +input+ IO which will assume +encoding+
*
* [Parameters]
* - +io+ (IO) The readable IO object from which to read input
* - +encoding_id+ (Integer) The libxml2 encoding ID to use, see SAX::Parser::ENCODINGS
* - +encoding_id+ (optional Integer) The libxml2 encoding ID to use, see SAX::Parser::ENCODINGS
* (default: 0, corresponding to "NONE", which means to autodetect the encoding)
*
* [Returns] Nokogiri::XML::SAX::ParserContext
*
* 💡 Calling Nokogiri::XML::SAX::Parser.parse is more convenient for most use cases.
*/
static VALUE
noko_xml_sax_parser_context_s_io(VALUE rb_class, VALUE rb_io, VALUE rb_encoding_id)
noko_xml_sax_parser_context_s_io(int argc, VALUE *argv, VALUE rb_class)
{
xmlParserCtxtPtr c_context;
xmlCharEncoding c_encoding = (xmlCharEncoding)NUM2INT(rb_encoding_id);
VALUE rb_io, rb_encoding_id;

rb_scan_args(argc, argv, "11", &rb_io, &rb_encoding_id);

xmlCharEncoding c_encoding;
if (RB_TEST(rb_encoding_id)) {
c_encoding = (xmlCharEncoding)NUM2INT(rb_encoding_id);
} else {
c_encoding = XML_CHAR_ENCODING_NONE;
}

if (!rb_respond_to(rb_io, id_read)) {
rb_raise(rb_eTypeError, "argument expected to respond to :read");
}

c_context = xmlCreateIOParserCtxt(NULL, NULL,
(xmlInputReadCallback)noko_io_read,
(xmlInputCloseCallback)noko_io_close,
(void *)rb_io, c_encoding);
xmlParserCtxtPtr c_context =
xmlCreateIOParserCtxt(NULL, NULL,
(xmlInputReadCallback)noko_io_read,
(xmlInputCloseCallback)noko_io_close,
(void *)rb_io, c_encoding);
if (!c_context) {
rb_raise(rb_eRuntimeError, "failed to create xml sax parser context");
}
Expand Down Expand Up @@ -113,34 +124,43 @@ noko_xml_sax_parser_context_s_file(VALUE rb_class, VALUE rb_path)
/*
* call-seq:
* memory(input)
* memory(input, encoding_id)
*
* Create a parser context for the +input+ String.
*
* [Parameters]
* - +input+ (String) The input string to be parsed.
* - +encoding_id+ (optional Integer) The libxml2 encoding ID to use, see SAX::Parser::ENCODINGS
* (default: 0, corresponding to "NONE", which means to autodetect the encoding)
*
* [Returns] Nokogiri::XML::SAX::ParserContext
*
* 💡 Calling Nokogiri::XML::SAX::Parser.parse is more convenient for most use cases.
*/
static VALUE
noko_xml_sax_parser_context_s_memory(VALUE rb_class, VALUE rb_input)
noko_xml_sax_parser_context_s_memory(int argc, VALUE *argv, VALUE rb_class)
{
xmlParserCtxtPtr c_context;
VALUE rb_input, rb_encoding_id;

rb_scan_args(argc, argv, "11", &rb_input, &rb_encoding_id);
Check_Type(rb_input, T_STRING);

if (!(int)RSTRING_LEN(rb_input)) {
rb_raise(rb_eRuntimeError, "input string cannot be empty");
}

c_context = xmlCreateMemoryParserCtxt(StringValuePtr(rb_input),
(int)RSTRING_LEN(rb_input));
xmlParserCtxtPtr c_context =
xmlCreateMemoryParserCtxt(StringValuePtr(rb_input), (int)RSTRING_LEN(rb_input));
if (c_context->sax) {
xmlFree(c_context->sax);
c_context->sax = NULL;
}

if (RB_TEST(rb_encoding_id)) {
xmlCharEncoding c_encoding = (xmlCharEncoding)NUM2INT(rb_encoding_id);
xmlSwitchEncoding(c_context, c_encoding);
}

return noko_xml_sax_parser_context_wrap(rb_class, c_context);
}

Expand Down Expand Up @@ -353,8 +373,8 @@ noko_init_xml_sax_parser_context(void)

rb_undef_alloc_func(cNokogiriXmlSaxParserContext);

rb_define_singleton_method(cNokogiriXmlSaxParserContext, "io", noko_xml_sax_parser_context_s_io, 2);
rb_define_singleton_method(cNokogiriXmlSaxParserContext, "memory", noko_xml_sax_parser_context_s_memory, 1);
rb_define_singleton_method(cNokogiriXmlSaxParserContext, "io", noko_xml_sax_parser_context_s_io, -1);
rb_define_singleton_method(cNokogiriXmlSaxParserContext, "memory", noko_xml_sax_parser_context_s_memory, -1);
rb_define_singleton_method(cNokogiriXmlSaxParserContext, "file", noko_xml_sax_parser_context_s_file, 1);

rb_define_method(cNokogiriXmlSaxParserContext, "parse_with", noko_xml_sax_parser_context__parse_with, 1);
Expand Down
48 changes: 43 additions & 5 deletions lib/nokogiri/xml/sax/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ class Attribute < Struct.new(:localname, :prefix, :uri, :value)
attr_accessor :encoding

# Create a new Parser with +doc+ and +encoding+
def initialize(doc = Nokogiri::XML::SAX::Document.new, encoding = "UTF-8")
def initialize(doc = Nokogiri::XML::SAX::Document.new, encoding = nil)
@encoding = check_encoding(encoding)
@document = doc
@warned = false
Expand All @@ -89,9 +89,27 @@ def parse(thing, &block)
end

###
# Parse given +io+
# :call-seq:
# parse_io(io)
# parse_io(io) { |parser_context| ... }
# parse_io(io, encoding)
# parse_io(io, encoding) { |parser_context| ... }
#
# Parse an input stream.
#
# [Parameters]
# - +io+ (IO) The readable IO object from which to read input
# - +encoding+ (optional String) An encoding name to use when parsing the input. (default
# `nil` for autodetection)
#
# [Yields]
# If a block is given, the underlying ParserContext object will be yielded. This can be used to set
# options on the parser context before parsing begins.
#
def parse_io(io, encoding = @encoding)
ctx = ParserContext.io(io, ENCODINGS[check_encoding(encoding)])
encoding_id = encoding ? ENCODINGS[check_encoding(encoding)] : ENCODINGS["NONE"]

ctx = ParserContext.io(io, encoding_id)
yield ctx if block_given?
ctx.parse_with(self)
end
Expand All @@ -108,15 +126,35 @@ def parse_file(filename)
ctx.parse_with(self)
end

def parse_memory(data)
ctx = ParserContext.memory(data)
# :call-seq:
# parse_memory(input)
# parse_memory(input) { |parser_context| ... }
# parse_memory(input, encoding)
# parse_memory(input, encoding) { |parser_context| ... }
#
# Parse an input string.
#
# [Parameters]
# - +input+ (String) The input string to be parsed.
# - +encoding+ (optional String) An encoding name to use when parsing the input. (default
# `nil` for autodetection)
#
# [Yields]
# If a block is given, the underlying ParserContext object will be yielded. This can be used to set
# options on the parser context before parsing begins.
#
def parse_memory(input, encoding = @encoding)
encoding_id = encoding ? ENCODINGS[check_encoding(encoding)] : ENCODINGS["NONE"]

ctx = ParserContext.memory(input, encoding_id)
yield ctx if block_given?
ctx.parse_with(self)
end

private

def check_encoding(encoding)
return nil unless encoding
encoding.upcase.tap do |enc|
raise ArgumentError, "'#{enc}' is not a valid encoding" unless ENCODINGS[enc]
end
Expand Down
92 changes: 84 additions & 8 deletions test/xml/sax/test_parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -215,9 +215,90 @@ class TestCase
end
end

it "has correct encoding" do
parser = Nokogiri::XML::SAX::Parser.new(Doc.new, "UTF-8")
assert_equal("UTF-8", parser.encoding)
describe "encoding" do
# proper ISO-8859-1 encoding
let(:xml_encoding_iso8859) { "<?xml version='1.0' encoding='ISO-8859-1'?>\n<content>B\xF6hnhardt</content>" }
# this input string is really UTF-8 but is marked as ISO-8859-1
let(:xml_encoding_broken) { "<?xml version='1.0' encoding='ISO-8859-1'?>\n<content>Böhnhardt</content>" }
# this input string is really ISO-8859-1 but is marked as UTF-8
let(:xml_encoding_broken2) { "<?xml version='1.0' encoding='UTF-8'?>\n<content>B\xF6hnhardt</content>" }

it "is nil by default to indicate encoding should be autodetected" do
parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
assert_nil(parser.encoding)
end

it "can be set in the initializer" do
assert_equal("UTF-8", Nokogiri::XML::SAX::Parser.new(Doc.new, "UTF-8").encoding)
assert_equal("ISO-2022-JP", Nokogiri::XML::SAX::Parser.new(Doc.new, "ISO-2022-JP").encoding)
end

it "raises when given an invalid encoding name" do
assert_raises(ArgumentError) { Nokogiri::XML::SAX::Parser.new(Doc.new, "not an encoding") }
assert_raises(ArgumentError) { parser.parse_io(StringIO.new("<root/>"), "not an encoding") }
assert_raises(ArgumentError) { parser.parse_memory("<root/>", "not an encoding") }
end

it "autodetects the encoding if not overridden" do
parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse(xml_encoding_iso8859)

# correctly converted the input ISO-8859-1 to UTF-8 for the callback
assert_equal("Böhnhardt", parser.document.data.join)
end

it "overrides the ISO-8859-1 document's encoding when set via initializer" do
# broken encoding!
parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse(xml_encoding_broken)

assert_equal("Böhnhardt", parser.document.data.join)

# override the encoding
parser = Nokogiri::XML::SAX::Parser.new(Doc.new, "UTF-8")
parser.parse(xml_encoding_broken)

assert_equal("Böhnhardt", parser.document.data.join)
end

it "overrides the UTF-8 document's encoding when set via initializer" do
# broken encoding!
parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse(xml_encoding_broken2)

assert(parser.document.errors.any? { |e| e.match(/Invalid bytes in character encoding/) })

# override the encoding
parser = Nokogiri::XML::SAX::Parser.new(Doc.new, "ISO-8859-1")
parser.parse(xml_encoding_broken2)

assert_equal("Böhnhardt", parser.document.data.join)
refute(parser.document.errors.any? { |e| e.match(/Invalid bytes in character encoding/) })
end

it "can be set via parse_io" do
parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse_io(StringIO.new(xml_encoding_broken), "UTF-8")

assert_equal("Böhnhardt", parser.document.data.join)

parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse_io(StringIO.new(xml_encoding_broken2), "ISO-8859-1")

assert_equal("Böhnhardt", parser.document.data.join)
end

it "can be set via parse_memory" do
parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse_memory(xml_encoding_broken, "UTF-8")

assert_equal("Böhnhardt", parser.document.data.join)

parser = Nokogiri::XML::SAX::Parser.new(Doc.new)
parser.parse_memory(xml_encoding_broken2, "ISO-8859-1")

assert_equal("Böhnhardt", parser.document.data.join)
end
end

it "error strings are UTF-8" do
Expand Down Expand Up @@ -294,11 +375,6 @@ class TestCase
end
end

it "raises when given an invalid encoding name" do
assert_raises(ArgumentError) { Nokogiri::XML::SAX::Parser.new(Doc.new, "not an encoding") }
assert_raises(ArgumentError) { parser.parse_io(StringIO.new("<root/>"), "not an encoding") }
end

it "cdata_block is called when CDATA is parsed" do
parser.parse_memory(<<~XML)
<p id="asdfasdf">
Expand Down

0 comments on commit ec4389e

Please sign in to comment.