Skip to content

Commit

Permalink
Fix: partition pdf overflow error (#2054)
Browse files Browse the repository at this point in the history
Closes #2050.
### Summary
- set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR
data
- update `determine_pdf_auto_strategy` to return the `hi_res` strategy
if either `infer_table_structure` or `extract_images_in_pdf` is true
### Testing
PDF:
[getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf)

Run the following code in both the `main` branch and the `current`
branch.

```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="getty_62-62.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)
```
  • Loading branch information
christinestraub authored Nov 10, 2023
1 parent f8c180a commit b11c546
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 3 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.30-dev5
## 0.10.30

### Enhancements

Expand All @@ -12,6 +12,8 @@

### Fixes

* **Fix logic that determines pdf auto strategy.** Previously, `_determine_pdf_auto_strategy` returned `hi_res` strategy only if `infer_table_structure` was true. It now returns the `hi_res` strategy if either `infer_table_structure` or `extract_images_in_pdf` is true.
* **Fix invalid coordinates when parsing tesseract ocr data.** Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to `0`. A logical check is now added to avoid such error.
* **Fix ingest partition parameters not being passed to the api.** When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
* **Support tables in section-less DOCX.** Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
* **Support tables that contain only numbers when partitioning via `ocr_only`** Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats.
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.30-dev5" # pragma: no cover
__version__ = "0.10.30" # pragma: no cover
3 changes: 3 additions & 0 deletions unstructured/partition/ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,9 @@ def parse_ocr_data_tesseract(ocr_data: pd.DataFrame, zoom: float = 1) -> List[Te
data frame will result in its associated bounding box being ignored.
"""

if zoom <= 0:
zoom = 1

text_regions = []
for idtx in ocr_data.itertuples():
text = idtx.text
Expand Down
2 changes: 2 additions & 0 deletions unstructured/partition/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ def partition_pdf_or_image(
file=file,
is_image=is_image,
infer_table_structure=infer_table_structure,
extract_images_in_pdf=extract_images_in_pdf,
)
!= "ocr_only"
):
Expand All @@ -304,6 +305,7 @@ def partition_pdf_or_image(
is_image=is_image,
infer_table_structure=infer_table_structure,
pdf_text_extractable=pdf_text_extractable,
extract_images_in_pdf=extract_images_in_pdf,
)

if strategy == "hi_res":
Expand Down
5 changes: 4 additions & 1 deletion unstructured/partition/strategies.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ def determine_pdf_or_image_strategy(
is_image: bool = False,
infer_table_structure: bool = False,
pdf_text_extractable: bool = True,
extract_images_in_pdf: bool = False,
):
"""Determines what strategy to use for processing PDFs or images, accounting for fallback
logic if some dependencies are not available."""
Expand All @@ -62,6 +63,7 @@ def determine_pdf_or_image_strategy(
strategy = _determine_pdf_auto_strategy(
pdf_text_extractable=pdf_text_extractable,
infer_table_structure=infer_table_structure,
extract_images_in_pdf=extract_images_in_pdf,
)

if file is not None:
Expand Down Expand Up @@ -124,12 +126,13 @@ def _determine_image_auto_strategy():
def _determine_pdf_auto_strategy(
pdf_text_extractable: bool = True,
infer_table_structure: bool = False,
extract_images_in_pdf: bool = False,
):
"""If "auto" is passed in as the strategy, determines what strategy to use
for PDFs."""
# NOTE(robinson) - Currrently "hi_res" is the only stategy where
# infer_table_structure is used.
if infer_table_structure:
if infer_table_structure or extract_images_in_pdf:
return "hi_res"

if pdf_text_extractable:
Expand Down

0 comments on commit b11c546

Please sign in to comment.