Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parsing pdf error - new_cells as str has no "copy" #3130

Merged
merged 5 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.14.4-dev6
## 0.14.4

### Enhancements

Expand All @@ -12,6 +12,7 @@

### Fixes

* **Address the issue of unrecognized tables in `UnstructuredTableTransformerModel`** When a table is not recognized, the `element.metadata.text_as_html` attribute is set to an empty string.
* **Remove root handlers in ingest logger**. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
* **Fix V2 S3 Destination Connector authentication** Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
* **Clarified dependence on particular version of `python-docx`** Pinned `python-docx` version to ensure a particular method `unstructured` uses is included.
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.4-dev6" # pragma: no cover
__version__ = "0.14.4" # pragma: no cover
3 changes: 2 additions & 1 deletion unstructured/partition/pdf_image/ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,8 @@ def supplement_element_with_table_extraction(
cropped_image, ocr_tokens=table_tokens, result_format="cells"
)

text_as_html = cells_to_html(tatr_cells)
# NOTE(christine): `tatr_cells == ""` means that the table was not recognized
text_as_html = "" if tatr_cells == "" else cells_to_html(tatr_cells)
element.text_as_html = text_as_html

if env_config.EXTRACT_TABLE_AS_CELLS:
Expand Down
Loading