Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.4.6
0.4.6
- Loosen the default cap threshold to
0.5
. - Add a
UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD
environment variable for controlling
the cap ratio threshold. - Unknown text elements are identified as
Text
for HTML and plain text documents. Body Text
styles no longer default toNarrativeText
for Word documents. The style information
is insufficient to determine that the text is narrative.- Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
- Adds an
Address
element for capturing elements that only contain an address. - Suppress the
UserWarning
when detectron is called. - Checks that titles and narrative test have at least one English word.
- Checks that titles and narrative text are at least 50% alpha characters.
- Restricts titles to a maximum word length. Adds a
UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
environment variable for controlling the max number of words in a title. - Updated
partition_pptx
to order the elements on the page
0.4.4
0.4.4
- Updated
partition_pdf
andpartition_image
to returnunstructured
Element
objects - Fixed the healthcheck url path when partitioning images and PDFs via API
- Adds an optional
coordinates
attribute to document objects - Adds
FigureCaption
andCheckBox
document elements - Added ability to split lists detected in
LayoutElement
objects - Adds
partition_pptx
for partitioning PowerPoint documents - LayoutParser models now download from HugginfaceHub instead of DropBox
- Fixed file type detection for XML and HTML files on Amazone Linux
0.4.3
0.4.3
- Adds
requests
as a base dependency - Fix in
exceeds_cap_ratio
so the function doesn't break with empty text - Fix bug in
_parse_received_data
. - Update
detect_filetype
to properly handle.doc
,.xls
, and.ppt
.
0.4.2
0.4.2
- Added
partition_image
to process documents in an image format. - Fixed utf-8 encoding error in
partition_email
with attachments fortext/html
0.4.1
0.4.1
- Added support for text files in the
partition
function - Pinned
opencv-python
for easier installation on Linux
0.4.0
0.4.0
- Added generic
partition
brick that detects the file type and routes a file to the appropriate
partitioning brick. - Added a file type detection module.
- Updated
partition_html
andpartition_eml
to support file-like objects in 'rb' mode. - Cleaning brick for removing ordered bullets
clean_ordered_bullets
. - Extract brick method for ordered bullets
extract_ordered_bullets
. - Test for
clean_ordered_bullets
. - Test for
extract_ordered_bullets
. - Added
partition_docx
for pre-processing Word Documents. - Added new REGEX patterns to extract email header information
- Added new functions to extract header information
parse_received_data
andpartition_header
- Added new function to parse plain text files
partition_text
- Added new cleaners functions
extract_ip_address
,extract_ip_address_name
,extract_mapi_id
,extract_datetimetz
- Add new
Image
element and function to find embedded imagesfind_embedded_images
- Added
get_directory_file_info
for summarizing information about source documents
0.3.5
0.3.5
- Add support for local inference
- Add new pattern to recognize plain text dash bullets
- Add test for bullet patterns
- Fix for partition_html that allows for processing div tags that have both text and child elements
- Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
- Helper functions for identifying and extracting phone numbers
- Add new function extract_attachment_info that extracts and decode the attachment of an email.
- Staging brick to convert a list of Elements to a pandas dataframe.
0.3.4
0.3.4
- Python-3.7 compat
0.3.3
0.3.3
- Removes BasicConfig from logger configuration
- Adds the
partition_email
partitioning brick - Adds the
replace_mime_encodings
cleaning bricks - Small fix to HTML parsing related to processing list items with sub-tags
0.3.2
0.3.2
- Added
translate_text
brick for translating text between languages - Add an
apply
method to make it easier to apply cleaners to elements