Skip to content

Releases: Unstructured-IO/unstructured

0.12.2

20 Jan 22:42
149f894
Compare
Choose a tag to compare

0.12.2

Enhancements

Features

Fixes

  • Fix index error in table processing. Bumps the unstructured-inference version to address and
    index error that occurs on some tables in the table transformer object.

0.12.1

20 Jan 00:13
c81d4e3
Compare
Choose a tag to compare

0.12.1

Enhancements

  • Allow setting image block crop padding parameter In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped.
  • Add suport for bitmap images in partition_image Adds support for .bmp files in
    partition, partition_image, and detect_filetype.
  • Keep all image elements when using "hi_res" strategy Previously, Image elements with small chunks of text were ignored unless the image block extraction parameters (extract_images_in_pdf or extract_image_block_types) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified.
  • Add filetype detection for .wav files. Add filetpye detection for .wav files.
  • Add "basic" chunking strategy. Add baseline chunking strategy that includes all shared chunking behaviors without breaking chunks on section or page boundaries.
  • Add overlap option for chunking. Add option to overlap chunks. Intra-chunk and inter-chunk overlap are requested separately. Intra-chunk overlap is applied only to the second and later chunks formed by text-splitting an oversized chunk. Inter-chunk overlap may also be specified; this applies overlap between "normal" (not-oversized) chunks.
  • Salesforce connector accepts private key path or value. Salesforce parameter private-key-file has been renamed to private-key. Private key can be provided as path to file or file contents.
  • Update documentation: (i) added verbiage about the free API cap limit, (ii) added deprecation warning on Staging bricks in favor of Destination Connectors, (iii) added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs, (iv) fixed example pages formatting, (v) added deprecation on model_name in favor of hi_res_model_name, (vi) added extract_images_in_pdf usage in partition_pdf section, (vii) reorganize and improve the documentation introduction section, and (viii) added PDF table extraction best practices.
  • Add "basic" chunking to ingest CLI. Add options to ingest CLI allowing access to the new "basic" chunking strategy and overlap options.
  • Make Elasticsearch Destination connector arguments optional. Elasticsearch Destination connector write settings are made optional and will rely on default values when not specified.
  • Normalize Salesforce artifact names. Introduced file naming pattern present in other connectors to Salesforce connector.
  • Install Kapa AI chatbot. Added Kapa.ai website widget on the documentation.

Features

  • MongoDB Source Connector. New source connector added to all CLI ingest commands to support downloading/partitioning files from MongoDB.
  • Add OpenSearch source and destination connectors. OpenSearch, a fork of Elasticsearch, is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added OpenSearch source connector to support downloading/partitioning files. Added OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch.

Fixes

  • Fix GCS connector converting JSON to string with single quotes. FSSpec serialization caused conversion of JSON token to string with single quotes. GCS requires token in form of dict so this format is now assured.
  • Pin version of unstructured-client Set minimum version of unstructured-client to avoid raising a TypeError when passing api_key_auth to UnstructuredClient
  • Fix the serialization of the Pinecone destination connector. Presence of the PineconeIndex object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
  • Fix the serialization of the Elasticsearch destination connector. Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
  • Fix the serialization of the Postgres destination connector. Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
  • Fix documentation and sample code for Chroma. Was pointing to wrong examples..
  • Fix flatten_dict to be able to flatten tuples inside dicts Update flatten_dict function to support flattening tuples inside dicts. This is necessary for objects like Coordinates, when the object is not written to the disk, therefore not being converted to a list before getting flattened (still being a tuple).
  • Fix the serialization of the Chroma destination connector. Presence of the ChromaCollection object breaks serialization due to TypeError: cannot pickle 'module' object. This removes that object before serialization.
  • Fix fsspec connectors returning version as integer. Connector data source versions should always be string values, however we were using the integer checksum value for the version for fsspec connectors. This casts that value to a string.

0.12.0

10 Jan 14:48
b37b468
Compare
Choose a tag to compare
  • Drop support for python3.8 All dependencies are now built off of the minimum version of python being 3.10

0.11.8

03 Jan 22:44
8e2bfca
Compare
Choose a tag to compare

0.11.8

Enhancements

  • Add SaaS API User Guide. This documentation serves as a guide for Unstructured SaaS API users to register, receive an API key and URL, and manage your account and billing information.

0.11.7

03 Jan 20:59
91b892c
Compare
Choose a tag to compare

Enhancements

  • Add intra-chunk overlap capability. Implement overlap for split-chunks where text-splitting is used to divide an oversized chunk into two or more chunks that fit in the chunking window. Note this capability is not yet available from the API but will shortly be made accessible using a new overlap kwarg on partition functions.
  • Update encoders to leverage dataclasses All encoders now follow a class approach which get annotated with the dataclass decorator. Similar to the connectors, it uses a nested dataclass for the configs required to configure a client as well as a field/property approach to cache the client. This makes sure any variable associated with the class exists as a dataclass field.

Features

  • Add Qdrant destination connector. Adds support for writing documents and embeddings into a Qdrant collection.
  • Store base64 encoded image data in metadata fields. Rather than saving to file, stores base64 encoded data of the image bytes and the mimetype for the image in metadata fields: image_base64 and image_mime_type (if that is what the user specifies by some other param like pdf_extract_to_payload). This would allow the API to have parity with the library.

Fixes

  • Fix table structure metric script Update the call to table agent to now provide OCR tokens as required
  • Fix element extraction not working when using "auto" strategy for pdf and image If element extraction is specified, the "auto" strategy falls back to the "hi_res" strategy.
  • Fix a bug passing a custom url to partition_via_api Users that self host the api were not able to pass their custom url to partition_via_api.

0.11.6

20 Dec 21:37
4533bda
Compare
Choose a tag to compare

0.11.6

Enhancements

  • Update the layout analysis script. The previous script only supported annotating final elements. The updated script also supports annotating inferred and extracted elements.
  • AWS Marketplace API documentation: Added the user guide, including setting up VPC and CloudFormation, to deploy Unstructured API on AWS platform.
  • Azure Marketplace API documentation: Improved the user guide to deploy Azure Marketplace API by adding references to Azure documentation.
  • Integration documentation: Updated URLs for the staging_for bricks

Features

  • Partition emails with base64-encoded text. Automatically handles and decodes base64 encoded text in emails with content type text/plain and text/html.
  • Add Chroma destination connector Chroma database connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned/embedded data to a Chroma vector database.
  • Add Elasticsearch destination connector. Problem: After ingesting data from a source, users might want to move their data into a destination. Elasticsearch is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch.

Fixes

  • Enable --fields argument omission for elasticsearch connector Solves two bugs where removing the optional parameter --fields broke the connector due to an integer processing error and using an elasticsearch config for a destination connector resulted in a serialization issue when optional parameter --fields was not provided.

0.11.5

17 Dec 02:27
9efc22c
Compare
Choose a tag to compare

0.11.5

Enhancements

Features

Fixes

  • Fix partition_pdf() and partition_image() importation issue. Reorganize pdf.py and image.py modules to be consistent with other types of document import code.

0.11.4

15 Dec 01:12
8ba1bed
Compare
Choose a tag to compare

0.11.4

Enhancements

  • Refactor image extraction code. The image extraction code is moved from unstructured-inference to unstructured.
  • Refactor pdfminer code. The pdfminer code is moved from unstructured-inference to unstructured.
  • Improve handling of auth data for fsspec connectors. Leverage an extension of the dataclass paradigm to support a sensitive annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
  • Add glob support for fsspec connectors Similar to the glob support in the ingest local source connector, similar filters are now enabled on all fsspec based source connectors to limit files being partitioned.
  • Define a constant for the splitter "+" used in tesseract ocr languages.

Features

  • Save tables in PDF's separately as images. The "table" elements are saved as table-<pageN>-<tableN>.jpg. This filename is presented in the image_path metadata field for the Table element. The default would be to not do this.
  • Add Weaviate destination connector Weaviate connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to a Weaviate object collection.
  • Sftp Source Connector. New source connector added to support downloading/partitioning files from Sftp.

Fixes

  • Fix pdf hi_res partitioning failure when pdfminer fails. Implemented logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the hi_res strategy.
  • Fix a bug where image can be scaled too large for tesseract Adds a limit to prevent auto-scaling an image beyond the maximum size tesseract can handle for ocr layout detection
  • Update partition_csv to handle different delimiters CSV files containing both non-comma delimiters and commas in the data were throwing an error in Pandas. partition_csv now identifies the correct delimiter before the file is processed.
  • partition returning cid code in hi_res occasionally pdfminer can fail to decode the text in an pdf file and return cid code as text. Now when this happens the text from OCR is used.

0.11.2

30 Nov 04:40
039ae17
Compare
Choose a tag to compare

0.11.2

Enhancements

  • Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.

Features

  • Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.

Fixes

  • Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

0.11.1

29 Nov 21:48
341f0f4
Compare
Choose a tag to compare

0.11.1

Enhancements

  • Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

  • Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

  • Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
  • Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

  • Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
  • Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
  • Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
  • Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.