Release 0.10.24 · Unstructured-IO/unstructured

Improve natural reading order Some OCR elements with only spaces in the text have full-page width in the bounding box, which causes the xycut sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space).
Ingest compression utilities and fsspec connector support Generic utility code added to handle files that get pulled from a source connector that are either tar or zip compressed and uncompress them locally. This is then processed using a local source connector. Currently this functionality has been incorporated into the fsspec connector and all those inheriting from it (currently: Azure Blob Storage, Google Cloud Storage, S3, Box, and Dropbox).
Ingest destination connectors support for writing raw list of elements Along with the default write method used in the ingest pipeline to write the json content associated with the ingest docs, each destination connector can now also write a raw list of elements to the desired downstream location without having an ingest doc associated with it.

Adds element type percent match function In order to evaluate the element type extracted, we add a function that calculates the matched percentage between two frequency dictionary.

Fix paddle model file not discoverable Fixes issue where ocr_models/paddle_ocr.py file is not discoverable on PyPI by adding
an __init__.py file under the folder.
Chipper v2 Fixes Includes fix for a memory leak and rare last-element bbox fix. (unstructured-inference==0.7.7)
Fix image resizing issue Includes fix related to resizing images in the tables pipeline. (unstructured-inference==0.7.6)

Provide feedback