You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Enhancements
Unique and deterministic hash IDs for elements Element IDs produced by any partitioning
function are now deterministic and unique at the document level by default. Before, hashes were
based only on text; however, they now also take into account the element's sequence number on a
page, the page's number in the document, and the document's file name.
Enable remote chunking via unstructured-ingest Chunking using unstructured-ingest was
previously limited to local chunking using the strategies basic and by_title. Remote chunking
options via the API are now accessible.
Save table in cells format. UnstructuredTableTransformerModel is able to return predicted table in cells format
Features
Add a PDF_ANNOTATION_THRESHOLD environment variable to control the capture of embedded links in partition_pdf() for fast strategy.
Add integration with the Google Cloud Vision API. Adds a third OCR provider, alongside Tesseract and Paddle: the Google Cloud Vision API.
Fixes
Remove ElementMetadata.section field.. This field was unused, not populated by any partitioners.