Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare raises an exception when serializing the new document in PRONOM 89 (Python 2) #104

Closed
mistydemeo opened this issue Mar 12, 2017 · 2 comments
Milestone

Comments

@mistydemeo
Copy link
Contributor

mistydemeo commented Mar 12, 2017

When trying to update to PRONOM 89, I encountered an exception being raised by the ElementTree serializer when trying to write Fido's reformatted XML documents. In the original context this happens when we serialize a single large XML document. To isolate the bug I adapted the script to write out each individual converted PRONOM record, and confirmed that

  1. This happens when we write individual converted records (but not all of them), and
  2. It does not happen if we write out the original unaltered documents.

Sample traceback:

Traceback (most recent call last):
  File "test.py", line 473, in <module>
    print(ET.tostring(doc), file=devnull)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 380: ordinal not in range(128)
@mistydemeo
Copy link
Contributor Author

This is possibly related to #83, but doesn't go through that codepath.

@mistydemeo
Copy link
Contributor Author

mistydemeo commented Mar 12, 2017

I converted this to use lxml for debugging, and it turns out lxml produces a better traceback here:

Traceback (most recent call last):
  File "test.py", line 471, in <module>
    doc = parse_pronom_xml(f)
  File "test.py", line 377, in parse_pronom_xml
    ET.SubElement(fido_sig, 'note').text = get_text_tna(pronom_sig, 'SignatureNote').encode('UTF-8')
  File "lxml.etree.pyx", line 951, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:46353)
  File "apihelpers.pxi", line 695, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20953)
  File "apihelpers.pxi", line 683, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:20829)
  File "apihelpers.pxi", line 1393, in lxml.etree._utf8 (src/lxml/lxml.etree.c:27125)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The explicit .encode calls broke this, since the resulting UTF8-encoded bytestrings contain non-ASCII characters, and ElementTree is treating this as a string to be converted rather than a Unicode string.

mistydemeo added a commit to mistydemeo/fido that referenced this issue Mar 12, 2017
These were presumably meant to avoid this type of issue, but instead
caused ElementTree to fail with a difficult-to-diagnose traceback. This
occurred if any strings being encoded were Unicode strings containing
non-ASCII characters, such as smart quotes.

Fixes openpreserve#104.
Hwesta pushed a commit to Hwesta/fido that referenced this issue Mar 24, 2017
These were presumably meant to avoid this type of issue, but instead
caused ElementTree to fail with a difficult-to-diagnose traceback. This
occurred if any strings being encoded were Unicode strings containing
non-ASCII characters, such as smart quotes.

Fixes openpreserve#104.
jhsimpson pushed a commit that referenced this issue Jun 16, 2017
These were presumably meant to avoid this type of issue, but instead
caused ElementTree to fail with a difficult-to-diagnose traceback. This
occurred if any strings being encoded were Unicode strings containing
non-ASCII characters, such as smart quotes.

Fixes #104.
@jhsimpson jhsimpson added this to the 1.3.6 milestone Jun 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants