Skip to content

Commit 171b5df

Browse files
authored
fix: set resolve_entities=False in partition_xml (#3088)
### Summary Closes #3078. Sets `resolve_entities=False` for parsing XML with `lxml` in `partition_xml` to avoid text being dynamically injected into the document. ### Testing `pytest test_unstructured/partition/test_xml.py` continues to pass with the update.
1 parent 9b83330 commit 171b5df

File tree

3 files changed

+5
-3
lines changed

3 files changed

+5
-3
lines changed

CHANGELOG.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.14.3-dev1
1+
## 0.14.3-dev2
22

33
### Enhancements
44

@@ -8,6 +8,8 @@
88

99
### Fixes
1010

11+
**Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
12+
to avoid text being dynamically injected into the XML document.
1113
* Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call.
1214

1315
## 0.14.2

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.14.3-dev1" # pragma: no cover
1+
__version__ = "0.14.3-dev2" # pragma: no cover

unstructured/partition/xml.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def _get_leaf_elements(
5151
"""Parse the XML tree in a memory efficient manner if possible."""
5252
element_stack = []
5353

54-
element_iterator = etree.iterparse(file, events=("start", "end"))
54+
element_iterator = etree.iterparse(file, events=("start", "end"), resolve_entities=False)
5555
# NOTE(alan) If xml_path is used for filtering, I've yet to find a good way to stream
5656
# elements through in a memory efficient way, so we bite the bullet and load it all into
5757
# memory.

0 commit comments

Comments
 (0)