Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.13.4
Enhancements
- Unique and deterministic hash IDs for elements Element IDs produced by any partitioning
function are now deterministic and unique at the document level by default. Before, hashes were
based only on text; however, they now also take into account the element's sequence number on a
page, the page's number in the document, and the document's file name. - Enable remote chunking via unstructured-ingest Chunking using unstructured-ingest was
previously limited to local chunking using the strategiesbasic
andby_title
. Remote chunking
options via the API are now accessible. - Save table in cells format.
UnstructuredTableTransformerModel
is able to return predicted table in cells format
Features
- Add a
PDF_ANNOTATION_THRESHOLD
environment variable to control the capture of embedded links inpartition_pdf()
forfast
strategy. - Add integration with the Google Cloud Vision API. Adds a third OCR provider, alongside Tesseract and Paddle: the Google Cloud Vision API.
Fixes
- Remove ElementMetadata.section field.. This field was unused, not populated by any partitioners.
0.13.3
Enhancements
- Remove duplicate image elements. Remove image elements identified by PDFMiner that have similar bounding boxes and the same text.
- Add support for
start_index
inhtml
links extraction - Add
strategy
arg value to_PptxPartitionerOptions
. This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning. - Support pluggable sub-partitioner for PPTX Picture shapes. Use a distinct sub-partitioner for partitioning PPTX Picture (image) shapes and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.
- Introduce
starting_page_number
parameter to partitioning functions It applies to those partitioners which supportpage_number
in element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX. - Redesign the internal mechanism of assigning element IDs This allows for further enhancements related to element IDs such as deterministic and document-unique hashes. The way partitioning functions operate hasn't changed, which means
unique_element_ids
continues to beFalse
by default, utilizing text hashes.
Features
Fixes
- Add support for extracting text from tag tails in HTML. This fix adds ability to generate separate elements using tag tails.
- Add support for extracting text from
<b>
tags in HTML Nowpartition_html()
can extract text from<b>
tags inside container tags (like<div>
,<pre>
). - Fix pip-compile make target Missing base.in dependency missing from requirments make file added
0.13.2
0.13.1
0.13.1
Enhancements
- Drop constraint on pydantic, supporting later versions All dependencies has pydantic pinned at an old version. This explicit pin was removed, allowing the latest version to be pulled in when requirements are compiled.
Features
- Add a set of new
ElementType
s to extend future element types
Fixes
- Fix
partition_html()
swallowing some paragraphs. Thepartition_html()
only considers elements with limited depth to avoid becoming the text representation of a giant div. This fix increases the limit value. - Fix SFTP Adds flag options to SFTP connector on whether to use ssh keys / agent, with flag values defaulting to False. This is to prevent looking for ssh files when using username and password. Currently, username and password are required, making that always the case.
0.13.0
0.13.0
Enhancements
- Add
.metadata.is_continuation
to text-split chunks..metadata.is_continuation=True
is added to second-and-later chunks formed by text-splitting an oversizedTable
element but not to their counterpartText
element splits. Add this indicator forCompositeElement
to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks. - Add
compound_structure_acc
metric to table eval. Add a new property tounstructured.metrics.table_eval.TableEvaluation
:composite_structure_acc
, which is computed from the element level row and column index and content accuracy scores - Add
.metadata.orig_elements
to chunks..metadata.orig_elements: list[Element]
is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, likepage_number
,coordinates
, andimage_base64
. - Add
--include_orig_elements
option to Ingest CLI. By default, when chunking, the original elements used to form each chunk are added tochunk.metadata.orig_elements
for each chunk. * Theinclude_orig_elements
parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata. - Add Google VertexAI embedder Adds VertexAI embeddings to support embedding via Google Vertex AI.
Features
- Chunking populates
.metadata.orig_elements
for each chunk. This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as.coordinates
that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by theinclude_orig_elements
parameter topartition_*()
or to the chunking functions. This option defaults toTrue
so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to otherunstructured
repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR. - Add Clarifai destination connector Adds support for writing partitioned and chunked documents into Clarifai.
Fixes
- Fix
clean_pdfminer_inner_elements()
to remove only pdfminer (embedded) elements merged with inferred elements. Previously, some embedded elements were removed even if they were not merged with inferred elements. Now, only embedded elements that are already merged with inferred elements are removed. - Clarify IAM Role Requirement for GCS Platform Connectors. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
- Change table extraction defaults Change table extraction defaults in favor of using
skip_infer_table_types
parameter and reflect these changes in documentation. - Fix OneDrive dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
- Adds tracking for AstraDB Adds tracking info so AstraDB can see what source called their api.
- Support AWS Bedrock Embeddings in ingest CLI The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
- Change MongoDB redacting Original redact secrets solution is causing issues in platform. This fix uses our standard logging redact solution.
0.12.6
0.12.6
Enhancements
- Improve ability to capture embedded links in
partition_pdf()
forfast
strategy Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing. - Refactor
add_chunking_strategy
decorator to dispatch by name. Addchunk()
function to be used by theadd_chunking_strategy
decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers.
Features
- Added Unstructured Platform Documentation The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors.
Fixes
- Partitioning raises on file-like object with
.name
not a local file path. When partitioning a file using thefile=
argument, andfile
is a file-like object (e.g. io.BytesIO) having a.name
attribute, and the value offile.name
is not a valid path to a file present on the local filesystem,FileNotFoundError
is raised. This prevents use of thefile.name
attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP. - Fix SharePoint dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.
- Include warnings about the potential risk of installing a version of
pandoc
which does not support RTF files + instructions that will help resolve that issue. - Incorporate the
install-pandoc
Makefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files. - Fix Google Drive source key Allow passing string for source connector key.
- Fix table structure evaluations calculations Replaced special value
-1.0
withnp.nan
and corrected rows filtering of files metrics basing on that. - Fix Sharepoint-with-permissions test Ignore permissions metadata, update test.
- Fix table structure evaluations for edge case Fixes the issue when the prediction does not contain any table - no longer errors in such case.
0.12.5
0.12.5
Features
- Header and footer detection for fast strategy
partition_pdf
withfast
strategy now
detects elements that are in the top or bottom 5 percent of the page as headers and footers. - Add parent_element to overlapping case output Adds parent_element to the output for
identify_overlapping_or_nesting_case
andcatch_overlapping_and_nested_bboxes
functions. - Add table structure evaluation Adds a new function to evaluate the structure of a table and return a metric that represents the quality of the table structure. This function is used to evaluate the quality of the table structure and the table contents.
- Add AstraDB destination connector Adds support for writing embedded documents into an AstraDB vector database.
Fixes
- Fix passing list type parameters when calling unstructured API via
partition_via_api()
Updatepartition_via_api()
to convert all list type parameters to JSON formatted strings before calling the unstructured client SDK. This will support image block extraction viapartition_via_api()
. - Add OctoAI embedder Adds support for embeddings via OctoAI.
- Fix
check_connection
in opensearch, databricks, postgres, azure connectors - **Fix don't treat plain text files with double quotes as JSON ** If a file can be deserialized as JSON but it deserializes as a string, treat it as plain text even though it's valid JSON.
- **Fix
check_connection
in opensearch, databricks, postgres, azure connectors ** - Fix cluster of bugs in
partition_xlsx()
that dropped content. Algorithm for detecting "subtables" within a worksheet dropped table elements for certain patterns of populated cells such as when a trailing single-cell row appeared in a contiguous block of populated cells. - Improved documentation. Fixed broken links and improved readability on
Key Concepts
page. - **Rename
OpenAiEmbeddingConfig
toOpenAIEmbeddingConfig
. - Fix partition_json() doesn't chunk. The
@add_chunking_strategy
decorator was missing frompartition_json()
such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.
0.12.4
0.12.4
Enhancements
- Apply New Version of
black
formatting Theblack
library recently introduced a new major version that introduces new formatting conventions. This change brings code in theunstructured
repo into compliance with the new conventions. - Move ingest imports to local scopes Moved ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies).
- Add support for
.p7s
filespartition_email
can now process.p7s
files. The signature for the signed message is extracted and added to metadata. - Fallback to valid content types for emails If the user selected content type does not exist on the email message,
partition_email
now falls back to anoter valid content type if it's available.
Features
- Add .heic file partitioning .heic image files were previously unsupported and are now supported though partition_image()
- Add the ability to specify an alternate OCR implementation by implementing an
OCRAgent
interface and specify it usingOCR_AGENT
environment variable. - Add Vectara destination connector Adds support for writing partitioned documents into a Vectara index.
Fixes
- Fix
partition_pdf()
not working when using chipper model withfile
- Handle common incorrect arguments for
languages
andocr_languages
Users are regularly receiving errors on the API because they are definingocr_languages
orlanguages
with additional quotationmarks, brackets, and similar mistakes. This update handles common incorrect arguments and raises an appropriate warning. - Default
hi_res_model_name
now relies onunstructured-inference
When no explicithi_res_model_name
is passed intopartition
orpartition_pdf_or_image
the default model is picked byunstructured-inference
's settings or os env variableUNSTRUCTURED_HI_RES_MODEL_NAME
; it now returns the same model name regardless ofinfer_table_structure
's value; this function will be deprecated in the future and the default model name will simply rely onunstructured-inference
and will not consider os env in a future release. - Fix remove Vectara requirements from setup.py - there are no dependencies
- Add missing dependency files to package manifest. Updates the file path for the ingest
dependencies and adds missing extra dependencies. - Fix remove Vectara requirements from setup.py - there are no dependencies
- Add title to Vectara upload - was not separated out from initial connector
- Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test
0.12.3
Enhancements
- Driver for MongoDB connector. Adds a driver with
unstructured
version information to the
MongoDB connector.
Features
- Add Databricks Volumes destination connector Databricks Volumes connector added to ingest CLI. Users may now use
unstructured-ingest
to write partitioned data to a Databricks Volumes storage service.
Fixes
- Fix support for different Chipper versions and prevent running PDFMiner with Chipper
- Treat YAML files as text. Adds YAML MIME types to the file detection code and treats those
files as text. - Fix FSSpec destination connectors check_connection. FSSpec destination connectors did not use
check_connection
. There was an error when trying tols
destination directory - it may not exist at the moment of connector creation. Nowcheck_connection
callsls
on bucket root and this method is called oninitialize
of destination connector. - Fix databricks-volumes extra location.
setup.py
is currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path. - Fix uploading None values to Chroma and Pinecone. Removes keys with None values with Pinecone and Chroma destinations. Pins Pinecone dependency
- Update documentation. (i) best practice for table extration by using 'skip_infer_table_types' param, instead of 'pdf_infer_table_structure', and (ii) fixed CSS, RST issues and typo in the documentation.
- Fix postgres storage of link_texts. Formatting of link_texts was breaking metadata storage.
0.12.2
0.12.2
Enhancements
Features
Fixes
- Fix index error in table processing. Bumps the
unstructured-inference
version to address and
index error that occurs on some tables in the table transformer object.