You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
Enhancements
Improve natural reading order Some OCR elements with only spaces in the text have full-page width in the bounding box, which causes the xycut sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space).
Ingest compression utilities and fsspec connector support Generic utility code added to handle files that get pulled from a source connector that are either tar or zip compressed and uncompress them locally. This is then processed using a local source connector. Currently this functionality has been incorporated into the fsspec connector and all those inheriting from it (currently: Azure Blob Storage, Google Cloud Storage, S3, Box, and Dropbox).
Ingest destination connectors support for writing raw list of elements Along with the default write method used in the ingest pipeline to write the json content associated with the ingest docs, each destination connector can now also write a raw list of elements to the desired downstream location without having an ingest doc associated with it.
Features
Adds element type percent match function In order to evaluate the element type extracted, we add a function that calculates the matched percentage between two frequency dictionary.
Fixes
Fix paddle model file not discoverable Fixes issue where ocr_models/paddle_ocr.py file is not discoverable on PyPI by adding
an __init__.py file under the folder.
Chipper v2 Fixes Includes fix for a memory leak and rare last-element bbox fix. (unstructured-inference==0.7.7)
Fix image resizing issue Includes fix related to resizing images in the tables pipeline. (unstructured-inference==0.7.6)