Skip to content

Releases: Unstructured-IO/unstructured

0.4.6

03 Feb 22:15
014585e
Compare
Choose a tag to compare

0.4.6

  • Loosen the default cap threshold to 0.5.
  • Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling
    the cap ratio threshold.
  • Unknown text elements are identified as Text for HTML and plain text documents.
  • Body Text styles no longer default to NarrativeText for Word documents. The style information
    is insufficient to determine that the text is narrative.
  • Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
  • Adds an Address element for capturing elements that only contain an address.
  • Suppress the UserWarning when detectron is called.
  • Checks that titles and narrative test have at least one English word.
  • Checks that titles and narrative text are at least 50% alpha characters.
  • Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
    environment variable for controlling the max number of words in a title.
  • Updated partition_pptx to order the elements on the page

0.4.4

25 Jan 17:01
1ce8447
Compare
Choose a tag to compare

0.4.4

  • Updated partition_pdf and partition_image to return unstructured Element objects
  • Fixed the healthcheck url path when partitioning images and PDFs via API
  • Adds an optional coordinates attribute to document objects
  • Adds FigureCaption and CheckBox document elements
  • Added ability to split lists detected in LayoutElement objects
  • Adds partition_pptx for partitioning PowerPoint documents
  • LayoutParser models now download from HugginfaceHub instead of DropBox
  • Fixed file type detection for XML and HTML files on Amazone Linux

0.4.3

18 Jan 17:31
59f972d
Compare
Choose a tag to compare

0.4.3

  • Adds requests as a base dependency
  • Fix in exceeds_cap_ratio so the function doesn't break with empty text
  • Fix bug in _parse_received_data.
  • Update detect_filetype to properly handle .doc, .xls, and .ppt.

0.4.2

17 Jan 16:36
9c3c14e
Compare
Choose a tag to compare

0.4.2

  • Added partition_image to process documents in an image format.
  • Fixed utf-8 encoding error in partition_email with attachments for text/html

0.4.1

13 Jan 22:23
419c086
Compare
Choose a tag to compare

0.4.1

  • Added support for text files in the partition function
  • Pinned opencv-python for easier installation on Linux

0.4.0

11 Jan 18:05
eba4c80
Compare
Choose a tag to compare

0.4.0

  • Added generic partition brick that detects the file type and routes a file to the appropriate
    partitioning brick.
  • Added a file type detection module.
  • Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
  • Cleaning brick for removing ordered bullets clean_ordered_bullets.
  • Extract brick method for ordered bullets extract_ordered_bullets.
  • Test for clean_ordered_bullets.
  • Test for extract_ordered_bullets.
  • Added partition_docx for pre-processing Word Documents.
  • Added new REGEX patterns to extract email header information
  • Added new functions to extract header information parse_received_data and partition_header
  • Added new function to parse plain text files partition_text
  • Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
  • Add new Image element and function to find embedded images find_embedded_images
  • Added get_directory_file_info for summarizing information about source documents

0.3.5

05 Jan 00:50
a75499d
Compare
Choose a tag to compare

0.3.5

  • Add support for local inference
  • Add new pattern to recognize plain text dash bullets
  • Add test for bullet patterns
  • Fix for partition_html that allows for processing div tags that have both text and child elements
  • Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
  • Helper functions for identifying and extracting phone numbers
  • Add new function extract_attachment_info that extracts and decode the attachment of an email.
  • Staging brick to convert a list of Elements to a pandas dataframe.

0.3.4

21 Dec 15:29
962c9dc
Compare
Choose a tag to compare

0.3.4

  • Python-3.7 compat

0.3.3

20 Dec 20:03
de4d0d4
Compare
Choose a tag to compare

0.3.3

  • Removes BasicConfig from logger configuration
  • Adds the partition_email partitioning brick
  • Adds the replace_mime_encodings cleaning bricks
  • Small fix to HTML parsing related to processing list items with sub-tags

0.3.2

15 Dec 22:20
1d68bb2
Compare
Choose a tag to compare

0.3.2

  • Added translate_text brick for translating text between languages
  • Add an apply method to make it easier to apply cleaners to elements