Skip to content

Commit 32df4ee

Browse files
fix: disable table_as_cells output by default (#3093)
This PR changes the output of table elements: now by default the table elements' `metadata.table_as_cells` is `None`. The data will only be populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`. The original design of the `table_as_cells` is for evaluate table extraction performance. The format itself is not as readable as the `table_as_html` metadata for human or RAG consumption. Therefore by default this data is not needed. Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the `partition` function call. Adding a new parameter to the `partition` interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed. ## test running the following code snippet on main vs. this PR ```python from unstructured.partition.auto import partition elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[]) table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"] ``` on main branch `table_cells` contains cell structured data but on this branch it is a list of `None` However if we first set in terminal: ```bash export EXTRACT_TABLE_AS_CELLS=true ``` then run the same code again with this PR the `table_cells` would contain actual data, the same as on main branch. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
1 parent 809c7e5 commit 32df4ee

File tree

5 files changed

+16
-559
lines changed

5 files changed

+16
-559
lines changed

CHANGELOG.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@
99

1010
### Fixes
1111

12+
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
13+
to avoid text being dynamically injected into the XML document.
1214
* **Add backward compatibility for the deprecated pdf_infer_table_structure parameter**.
1315
* **Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call**.
14-
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
1516
to avoid text being dynamically injected into the XML document.
1617
* **Chromadb change from Add to Upsert using element_id to make idempotent**
18+
* **Diable `table_as_cells` output by default** to reduce overhead in partition; now `table_as_cells` is only produced when the env `EXTACT_TABLE_AS_CELLS` is `true`
1719
* **Reduce excessive logging** Change per page ocr info level logging into detail level trace logging
1820
* **Replace try block in `document_to_element_list` for handling HTMLDocument** Use `getattr(element, "type", "")` to get the `type` attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block
1921

test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.json

-170
Original file line numberDiff line numberDiff line change
@@ -49,176 +49,6 @@
4949
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century \u2018TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents",
5050
"metadata": {
5151
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></thead><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></table>",
52-
"table_as_cells": [
53-
{
54-
"x": 0,
55-
"y": 0,
56-
"w": 1,
57-
"h": 1,
58-
"content": "Dataset"
59-
},
60-
{
61-
"x": 0,
62-
"y": 1,
63-
"w": 1,
64-
"h": 1,
65-
"content": "PubLayNet [33]"
66-
},
67-
{
68-
"x": 0,
69-
"y": 2,
70-
"w": 1,
71-
"h": 1,
72-
"content": "PRImA [3]"
73-
},
74-
{
75-
"x": 0,
76-
"y": 3,
77-
"w": 1,
78-
"h": 1,
79-
"content": "Newspaper [17]"
80-
},
81-
{
82-
"x": 0,
83-
"y": 4,
84-
"w": 1,
85-
"h": 1,
86-
"content": "TableBank [18]"
87-
},
88-
{
89-
"x": 0,
90-
"y": 5,
91-
"w": 1,
92-
"h": 1,
93-
"content": "HIDataset [31]"
94-
},
95-
{
96-
"x": 1,
97-
"y": 0,
98-
"w": 1,
99-
"h": 1,
100-
"content": "| Base Model!|"
101-
},
102-
{
103-
"x": 1,
104-
"y": 1,
105-
"w": 1,
106-
"h": 1,
107-
"content": "P/M"
108-
},
109-
{
110-
"x": 1,
111-
"y": 2,
112-
"w": 1,
113-
"h": 1,
114-
"content": "M"
115-
},
116-
{
117-
"x": 1,
118-
"y": 3,
119-
"w": 1,
120-
"h": 1,
121-
"content": "P"
122-
},
123-
{
124-
"x": 1,
125-
"y": 4,
126-
"w": 1,
127-
"h": 1,
128-
"content": "P"
129-
},
130-
{
131-
"x": 1,
132-
"y": 5,
133-
"w": 1,
134-
"h": 1,
135-
"content": "P/M"
136-
},
137-
{
138-
"x": 2,
139-
"y": 0,
140-
"w": 1,
141-
"h": 1,
142-
"content": "Large Model"
143-
},
144-
{
145-
"x": 2,
146-
"y": 1,
147-
"w": 1,
148-
"h": 1,
149-
"content": "M"
150-
},
151-
{
152-
"x": 2,
153-
"y": 2,
154-
"w": 1,
155-
"h": 1,
156-
"content": ""
157-
},
158-
{
159-
"x": 2,
160-
"y": 3,
161-
"w": 1,
162-
"h": 1,
163-
"content": ""
164-
},
165-
{
166-
"x": 2,
167-
"y": 4,
168-
"w": 1,
169-
"h": 1,
170-
"content": ""
171-
},
172-
{
173-
"x": 2,
174-
"y": 5,
175-
"w": 1,
176-
"h": 1,
177-
"content": ""
178-
},
179-
{
180-
"x": 3,
181-
"y": 0,
182-
"w": 1,
183-
"h": 1,
184-
"content": "| Notes"
185-
},
186-
{
187-
"x": 3,
188-
"y": 1,
189-
"w": 1,
190-
"h": 1,
191-
"content": "Layouts of modern scientific documents"
192-
},
193-
{
194-
"x": 3,
195-
"y": 2,
196-
"w": 1,
197-
"h": 1,
198-
"content": "Layouts of scanned modern magazines and scientific reports"
199-
},
200-
{
201-
"x": 3,
202-
"y": 3,
203-
"w": 1,
204-
"h": 1,
205-
"content": "Layouts of scanned US newspapers from the 20th century"
206-
},
207-
{
208-
"x": 3,
209-
"y": 4,
210-
"w": 1,
211-
"h": 1,
212-
"content": "Table region on modern scientific and business document"
213-
},
214-
{
215-
"x": 3,
216-
"y": 5,
217-
"w": 1,
218-
"h": 1,
219-
"content": "Layouts of history Japanese documents"
220-
}
221-
],
22252
"filetype": "image/jpeg",
22353
"languages": [
22454
"eng"

0 commit comments

Comments
 (0)