Skip to content

Commit

Permalink
Preserve control characters
Browse files Browse the repository at this point in the history
If a control character like `\u0002` appears in the XML it is preserved by the REXML parser, but Nokogiri parser bails out with an incomplete XML. Note that scrubbing the string does not help in this case since this is a valid Unicode character, but it is invalid in XML 1.0.

To handle this we extract the character from the error message. For parsing to continue we must also tell Nokogiri to recover from errors.
  • Loading branch information
stenlarsson committed Apr 5, 2024
1 parent 29c145b commit df767c4
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 2 deletions.
10 changes: 8 additions & 2 deletions lib/nori/parser/nokogiri.rb
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,21 @@ def characters(string)
alias cdata_block characters

def error(message)
@last_error = message
if (invalid_chr = message[/PCDATA invalid Char value (\d+)/, 1])
characters(invalid_chr.to_i.chr)
else
@last_error = message
end
end
end

def self.parse(xml, options)
document = Document.new
document.options = options
parser = ::Nokogiri::XML::SAX::Parser.new document
parser.parse xml
parser.parse xml do |ctx|
ctx.recovery = true
end
raise ParseError, document.last_error if document.last_error
document.stack.length > 0 ? document.stack.pop.to_hash : {}
end
Expand Down
4 changes: 4 additions & 0 deletions spec/nori/nori_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,10 @@
expect { parse('<foo><bar>foo bar</foo>') }.to raise_error(Nori::ParseError)
end

it "should preserve control characters" do
xml = "<tag>a\u0002c</tag>".force_encoding('UTF-8')
expect(parse(xml)["tag"]).to eq("a\u0002c")
end
end
end

Expand Down

0 comments on commit df767c4

Please sign in to comment.