Skip to content

Commit

Permalink
more sophisticated parsing of CJK/nontext boundary in dealing with li…
Browse files Browse the repository at this point in the history
  • Loading branch information
opoudjis committed Aug 6, 2024
1 parent a0ee052 commit b67e4fa
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 5 deletions.
15 changes: 10 additions & 5 deletions lib/utils/xml.rb
Original file line number Diff line number Diff line change
Expand Up @@ -59,21 +59,26 @@ def noko(_script = "Latn", &block)
end

# By default, carriage return in source translates to whitespace;
# but in CJK, it does not. We don't want carriage returns in the final
# output because of CJK complications
# but in CJK, it does not. (Non-CJK text \n CJK)
def line_sanitise(ret)
ret.size == 1 and return ret
(0...(ret.size - 1)).each do |i|
last = firstchar_xml(ret[i].reverse)
nextfirst = firstchar_xml(ret[i + 1])
/#{CJK}/o.match?(last) && /#{CJK}/o.match?(nextfirst) or
ret[i] += " "
cjk1 = /#{CJK}/o.match?(last)
cjk2 = /#{CJK}/o.match?(nextfirst)
text1 = /[^\p{Z}\p{C}]/.match?(last)
text2 = /[^\p{Z}\p{C}]/.match?(nextfirst)
(cjk1 && (cjk2 || !text2)) and next
!text1 && cjk2 and next
ret[i] += " "
end
ret
end

# need to deal with both <em> and its reverse string, >me<
def firstchar_xml(line)
m = /^(<[^>]+>)*(.)/.match(line) or return ""
m = /^([<>][^<>]+[<>])*(.)/.match(line) or return ""
m[2]
end

Expand Down
4 changes: 4 additions & 0 deletions spec/xml_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@
expect(Metanorma::Utils.line_sanitise(input)).to be_equivalent_to [
"す", "<em>る</em>", "場"
]
input = ["す", "<em>る</em>", "場", ""]
expect(Metanorma::Utils.line_sanitise(input)).to be_equivalent_to [
"す", "<em>る</em>", "場", ""
]
end

it "applies namespace to xpath" do
Expand Down

0 comments on commit b67e4fa

Please sign in to comment.