Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.11.0
0.11.0
Enhancements
- Add a class for the strategy constants. Add a class
PartitionStrategy
for the strategy constants and use the constants to replace strategy strings. - Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV
DEFAULT_PADDLE_LANG
before we have the language mapping for paddle. - Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.
Features
- Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like
element.metadata.coefficient = 0.58
. These fields will round-trip through JSON and can be accessed with dotted notation. - MongoDB Destination Connector New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.
Fixes
- Fix
TYPE_TO_TEXT_ELEMENT_MAP
UpdatedFigure
mapping fromFigureCaption
toImage
. - Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by
pdfminer
, causingpartition_pdf()
to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy. - Fix
fast
strategy fall back toocr_only
Thefast
strategy should not fall back to a more expensive strategy. - Remove default user ./ssh folder The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.
- Include
languages
in metadata when partitioningstrategy=hi_res
orfast
User definedlanguages
was previously used for text detection, but not included in the resulting element metadata for some strategies.languages
will now be included in the metadata regardless of partition strategy for pdfs and images. - Handle a case where Paddle returns a list item in ocr_data as None In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.
- Fix some pdfs returning
KeyError: 'N'
Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned. - Fix mis-splits on
Table
chunks. Remedies repeated appearance of full.text_as_html
on metadata of eachTableChunk
split from aTable
element too large to fit in the chunking window. - Import tables_agent from inference so that we don't have to initialize a global table agent in unstructured OCR again
- Fix empty table is identified as bulleted-table. A table with no text content was mistakenly identified as a bulleted-table and processed by the wrong branch of the initial HTML partitioner.
- Fix partition_html() emits empty (no text) tables. A table with cells nested below a
<thead>
or<tfoot>
element was emitted as a table element having no text and unparseable HTML inelement.metadata.text_as_html
. Do not emit empty tables to the element stream. - Fix HTML
element.metadata.text_as_html
contains spurious
elements in invalid locations. The HTML generated for thetext_as_html
metadata for HTML tables contained<br>
elements invalid locations like between<table>
and<tr>
. Change the HTML generator such that these do not appear. - Fix HTML table cells enclosed in and elements are dropped. HTML table cells nested in a
<thead>
or<tfoot>
element were not detected and the text in those cells was omitted from the table element text and.text_as_html
. Detect table rows regardless of the semantic tag they may be nested in. - Remove whitespace padding from
.text_as_html
.tabulate
inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability. - Fix local connector with absolute input path When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to
output-dir/input-filename.json
0.10.30
0.10.30
Enhancements
- Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.
- Add connection check to ingest connectors Each source and destination connector now support a
check_connection()
method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.
Features
- Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.
- Adds ability to pass timeout for a request when partitioning via a
url
.partition
now accepts a new optional parameterrequest_timeout
which if set will prevent anyrequests.get
from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.
Fixes
- Fix logic that determines pdf auto strategy. Previously,
_determine_pdf_auto_strategy
returnedhi_res
strategy only ifinfer_table_structure
was true. It now returns thehi_res
strategy if eitherinfer_table_structure
orextract_images_in_pdf
is true. - Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to
0
. A logical check is now added to avoid such error. - Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
- Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
- Support tables that contain only numbers when partitioning via
ocr_only
Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from.image_to_data()
. An AttributeError was raised downstream when trying to.strip()
the floats. - Improve DOCX page-break detection. DOCX page breaks are reliably indicated by
w:lastRenderedPageBreak
elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to aw:lastRenderedPageBreak
element so cause over-counting if used. Use rendered page-breaks only.
0.10.29
0.10.29
Enhancements
- Add include_header argument for partition_csv and partition_tsv Now supports retaining header rows in CSV and TSV documents element partitioning.
- Add retry logic for all source connectors All http calls being made by the ingest source connectors have been isolated and wrapped by the
SourceConnectionNetworkError
custom error, which triggers the retry logic, if enabled, in the ingest pipeline. - Google Drive source connector supports credentials from memory Originally, the connector expected a filepath to pull the credentials from when creating the client. This was expanded to support passing that information from memory as a dict if access to the file system might not be available.
- Add support for generic partition configs in ingest cli Along with the explicit partition options supported by the cli, an
additional_partition_args
arg was added to allow users to pass in any other arguments that should be added when calling partition(). This helps keep any changes to the input parameters of the partition() exposed in the CLI. - Map full output schema for table-based destination connectors A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this.
- Incorporate multiple embedding model options into ingest, add diff test embeddings Problem: Ingest pipeline already supported embedding functionality, however users might want to use different types of embedding providers. Enhancement: Extend ingest pipeline so that users can specify and embed via a particular embedding provider from a range of options. Also adds a diff test to compare output from an embedding module with the expected output
Features
- Allow setting table crop parameter In certain circumstances, adjusting the table crop padding may improve table.
Fixes
- Fixes
partition_text
to prevent empty elements Adds a check to filter out empty bullets. - Handle empty string for
ocr_languages
with values forlanguages
Some API users ran into an issue with sendinglanguages
params because the API defaulted to also using an empty string forocr_languages
. This update handles situations wherelanguages
is defined andocr_languages
is an empty string. - Fix PDF tried to loop through None Previously the PDF annotation extraction tried to loop through
annots
that resolved out as None. A logical check added to avoid such error. - Ingest session handler not being shared correctly All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
- Ingest download-only fix. Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
- Fix flaky chunk-metadata. Prior implementation was sensitive to element order in the section resulting in metadata values sometimes being dropped. Also, not all metadata items can be consolidated across multiple elements (e.g. coordinates) and so are now dropped from consolidated metadata.
- Fix tesseract error
Estimating resolution as X
leaded by invalid language parameters input. Proceed with defalut languageeng
whenlang.py
fails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.
0.10.28
0.10.28
Enhancements
- Add element type CI evaluation workflow Adds element type frequency evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
- Add table structure evaluation helpers Adds functions to evaluate the similarity between predicted table structure and actual table structure.
- Use
yolox
by default for table extraction when partitioning pdf/imageyolox
model provides higher recall of the table regions than the quantized version and it is now the default element detection model wheninfer_table_structure=True
for partitioning pdf/image files - Remove pdfminer elements from inside tables Previously, when using
hi_res
some elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements. - Fsspec downstream connectors New destination connector added to ingest CLI, users may now use
unstructured-ingest
to write to any of the following:- Azure
- Box
- Dropbox
- Google Cloud Service
Features
- Update
ocr_only
strategy inpartition_pdf()
Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with theocr_only
strategy.
Fixes
- Fixes issue where tables from markdown documents were being treated as text Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the
tables
extension when instantiating thepython-markdown
object. Importance: This will allow users to extract structured data from tables in markdown documents. - Fix wrong logger for paddle info Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
- Fix ingest pipeline to be able to use chunking and embedding together Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
- Fix unnecessary mid-text chunk-splitting. The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
- Fix frequent dissociation of title from chunk. The sectioning algorithm included the title of the next section with the prior section whenever it would fit, frequently producing association of a section title with the prior section and dissociating it from its actual section. Fix this by performing combination of whole sections only.
- Fix PDF attempt to get dict value from string. Fixes a rare edge case that prevented some PDF's from being partitioned. The
get_uris_from_annots
function tried to access the dictionary value of a string instance variable. AssignNone
to the annotation variable if the instance type is not dictionary to avoid the erroneous attempt.
0.10.27
0.10.27
Enhancements
- Leverage dict to share content across ingest pipeline To share the ingest doc content across steps in the ingest pipeline, this was updated to use a multiprocessing-safe dictionary so changes get persisted and each step has the option to modify the ingest docs in place.
Features
Fixes
- Removed
ebooklib
as a dependencyebooklib
is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed. - Caching fixes in ingest pipeline Previously, steps like the source node were not leveraging parameters such as
re_download
to dictate if files should be forced to redownload rather than use what might already exist locally.
0.10.26
0.10.26
Enhancements
- Add CI evaluation workflow Adds evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
Features
- Functionality to catch and classify overlapping/nested elements Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the
overlapping_elements
,overlapping_case
,overlapping_percentage
,largest_ngram_percentage
,overlap_percentage_total
,max_area
,min_area
, andtotal_area
. - Add Local connector source metadata python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
- Add Local connector source metadata. python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
Fixes
- Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured.
- Adds
typing-extensions
as an explicit dependency This package is an implicit dependency, but the module is being imported directly inunstructured.documents.elements
so the dependency should be explicit in case changes in other dependencies lead totyping-extensions
being dropped as a dependency. - Stop passing
extract_tables
tounstructured-inference
since it is now supported inunstructured
instead Table extraction previously occurred inunstructured-inference
, but that logic, except for the table model itself, is now a part of theunstructured
library. Thus the parameter triggering table extraction is no longer passed to theunstructured-inference
package. Also noted the table output regression for PDF files. - Fix a bug in Table partitioning Previously the
skip_infer_table_types
variable used inpartition
was not being passed down to specific file partitioners. Now you can utilize theskip_infer_table_types
list variable when callingpartition
to specify the filetypes for which you want to skip table extraction, or theinfer_table_structure
boolean variable on the file specific partitioning function. - Fix partition docx without sections Some docx files, like those from teams output, do not contain sections and it would produce no results because the code assumes all components are in sections. Now if no sections is detected from a document we iterate through the paragraphs and return contents found in the paragraphs.
- Fix out-of-order sequencing of split chunks. Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded
max_characters
. - Deserialization of ingest docs fixed When ingest docs are being deserialized as part of the ingest pipeline process (cli), there were certain fields that weren't getting persisted (metadata and date processed). The from_dict method was updated to take these into account and a unit test added to check.
- Map source cli command configs when destination set Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
0.10.25
Enhancements
- Duplicate CLI param check Given that many of the options associated with the
Click
based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options.
Features
- Table OCR refactor support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify ocr agent tesseract/paddle in environment variable
OCR_AGENT
for OCRing the entire document. - Adds accuracy function The accuracy scoring was originally an option under
calculate_edit_distance
. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score". - Adds HuggingFaceEmbeddingEncoder The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API.
- Add AWS bedrock embedding connector
unstructured.embed.bedrock
now provides a connector to use AWS bedrock'stitan-embed-text
model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.
Fixes
- Import PDFResourceManager more directly We were importing
PDFResourceManager
frompdfminer.converter
which was causing an error for some users. We changed to import from the actual location ofPDFResourceManager
, which ispdfminer.pdfinterp
. - Fix language detection of elements with empty strings This resolves a warning message that was raised by
langdetect
if the language was attempted to be detected on an empty string. Language detection is now skipped for empty strings. - Fix chunks breaking on regex-metadata matches. Fixes "over-chunking" when
regex_metadata
was used, where every element that contained a regex-match would start a new chunk. - Fix regex-metadata match offsets not adjusted within chunk. Fixes incorrect regex-metadata match start/stop offset in chunks where multiple elements are combined.
- Map source cli command configs when destination set Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
- Fix metrics folder not discoverable Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding an
__init__.py
file under the folder. - Fix a bug when
partition_pdf
getmodel_name=None
In API usage themodel_name
value isNone
and thecast
function inpartition_pdf
would returnNone
and lead to attribution error. Now we usestr
function to explicit convert the content to string so it is guaranteed to havestarts_with
and other string functions as attributes - Fix html partition fail on tables without
tbody
tag HTML tables may sometimes just contain headers without body (tbody
tag) - Fix out-of-order sequencing of split chunks. Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded
max_characters
.
0.10.24
Enhancements
- Improve natural reading order Some
OCR
elements with only spaces in the text have full-page width in the bounding box, which causes thexycut
sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space). - Ingest compression utilities and fsspec connector support Generic utility code added to handle files that get pulled from a source connector that are either tar or zip compressed and uncompress them locally. This is then processed using a local source connector. Currently this functionality has been incorporated into the fsspec connector and all those inheriting from it (currently: Azure Blob Storage, Google Cloud Storage, S3, Box, and Dropbox).
- Ingest destination connectors support for writing raw list of elements Along with the default write method used in the ingest pipeline to write the json content associated with the ingest docs, each destination connector can now also write a raw list of elements to the desired downstream location without having an ingest doc associated with it.
Features
- Adds element type percent match function In order to evaluate the element type extracted, we add a function that calculates the matched percentage between two frequency dictionary.
Fixes
- Fix paddle model file not discoverable Fixes issue where ocr_models/paddle_ocr.py file is not discoverable on PyPI by adding
an__init__.py
file under the folder. - Chipper v2 Fixes Includes fix for a memory leak and rare last-element bbox fix. (unstructured-inference==0.7.7)
- Fix image resizing issue Includes fix related to resizing images in the tables pipeline. (unstructured-inference==0.7.6)
0.10.23
0.10.23
Enhancements
- Add functionality to limit precision when serializing to json Precision for
points
is limited to 1 decimal point if coordinates["system"] == "PixelSpace" (otherwise 2 decimal points?). Precision fordetection_class_prob
is limited to 5 decimal points. - Fix csv file detection logic when mime-type is text/plain Previously the logic to detect csv file type was considering only first row's comma count comparing with the header_row comma count and both the rows being same line the result was always true, Now the logic is changed to consider the comma's count for all the lines except first line and compare with header_row comma count.
- Improved inference speed for Chipper V2 API requests with 'hi_res_model_name=chipper' now have ~2-3x faster responses.
Features
Fixes
- Cleans up temporary files after conversion Previously a file conversion utility was leaving temporary files behind on the filesystem without removing them when no longer needed. This fix helps prevent an accumulation of temporary files taking up excessive disk space.
- Fixes
under_non_alpha_ratio
dividing by zero Although this function guarded against a specific cause of division by zero, there were edge cases slipping through like strings with only whitespace. This update more generally prevents the function from performing a division by zero. - Fix languages default Previously the default language was being set to English when elements didn't have text or if langdetect could not detect the language. It now defaults to None so there is not misleading information about the language detected.
- Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues.
0.10.22
Enhancements
- bump
unstructured-inference
to0.7.3
The updated version ofunstructured-inference
supports a new version of the Chipper model, as well as a cleaner schema for its output classes. Support is included for new inference features such as hierarchy and ordering. - Expose skip_infer_table_types in ingest CLI. For each connector a new
--skip-infer-table-types
parameter was added to map to theskip_infer_table_types
partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify the file types for which we should attempt table extraction. - Add flag to ingest CLI to raise error if any single doc fails in pipeline Currently if a single doc fails in the pipeline, the whole thing halts due to the error. This flag defaults to log an error but continue with the docs it can.
- Emit hyperlink metadata for DOCX file-type. DOCX partitioner now adds
metadata.links
,metadata.link_texts
andmetadata.link_urls
for elements that contain a hyperlink that points to an external resource. So-called "jump" links pointing to document internal locations (such as those found in a table-of-contents "jumping" to a chapter or section) are excluded.
Features
-
Add
elements_to_text
as a staging helper function In order to get a single clean text output from unstructured for metric calculations, automate the process of extracting text from elements using this function. -
Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector. Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.
Fixes
- Fixes PDF list parsing creating duplicate list items Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list item
- Fixes duplicated elements Fixes issue where elements are duplicated when embeddings are generated. This will allow users to generate embeddings for their list of Elements without duplicating/breaking the orginal content.
- Fixes failure when flagging for embeddings through unstructured-ingest Currently adding the embedding parameter to any connector results in a failure on the copy stage. This is resolves the issue by adding the IngestDoc to the context map in the embedding node's
run
method. This allows users to specify that connectors fetch embeddings without failure. - Fix ingest pipeline reformat nodes not discoverable Fixes issue where reformat nodes raise ModuleNotFoundError on import. This was due to the directory was missing
__init__.py
in order to make it discoverable. - Fix default language in ingest CLI Previously the default was being set to english which injected potentially incorrect information to downstream language detection libraries. By setting the default to None allows those libraries to better detect what language the text is in the doc being processed.