Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.4.16
0.4.16
Enhancements
- Fallback to using file extensions for filetype detection if
libmagic
is not present
Features
- Added setup script for Ubuntu
- Added GitHub connector for ingest cli.
- Added
partition_md
partitioner. - Added Reddit connector for ingest cli.
Fixes
- Initializes connector properly in ingest.main::MainProcess
- Restricts version of unstructured-inference to avoid multithreading issue
0.4.15
0.4.15
Enhancements
- Added
elements_to_json
andelements_from_json
for easier serialization/deserialization convert_to_dict
,dict_to_elements
andconvert_to_csv
are now aliases for functions
that use the ISD terminology.
Fixes
- Update to ensure all elements are preserved during serialization/deserialization
0.4.14
0.4.14
- Automatically install
nltk
models in thetokenize
module.
0.4.13
0.4.12
0.4.11
0.4.11
- Adds
partition_doc
for partitioning Word documents in.doc
format. Requireslibreoffice
. - Adds
partition_ppt
for partitioning PowerPoint documents in.ppt
format. Requireslibreoffice
.
0.4.10
0.4.10
- Fixes
ElementMetadata
so that it's JSON serializable when the filename is aPath
object.
0.4.9
0.4.9
- Added ingest modules and s3 connector
- Default to
url=None
forpartition_pdf
andpartition_image
- Add ability to skip English specific check by setting the
UNSTRUCTURED_LANGUAGE
env var to""
. - Document
Element
objects now track metadata
0.4.8
0.4.8
- Modified XML and HTML parsers not to load comments.
0.4.7
- Added the ability to pull an HTML document from a url in
partition_html
. - Added the the ability to get file summary info from lists of filenames and lists
of file contents. - Added optional page break to
partition
for.pptx
,.pdf
, images, and.html
files. - Added
to_dict
method to document elements. - Include more unicode quotes in
replace_unicode_quotes
.