Skip to content

0.13.3

Compare
Choose a tag to compare
@scanny scanny released this 21 Apr 04:01
· 376 commits to main since this release
305247b

Enhancements

  • Remove duplicate image elements. Remove image elements identified by PDFMiner that have similar bounding boxes and the same text.
  • Add support for start_index in html links extraction
  • Add strategy arg value to _PptxPartitionerOptions. This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning.
  • Support pluggable sub-partitioner for PPTX Picture shapes. Use a distinct sub-partitioner for partitioning PPTX Picture (image) shapes and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.
  • Introduce starting_page_number parameter to partitioning functions It applies to those partitioners which support page_number in element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX.
  • Redesign the internal mechanism of assigning element IDs This allows for further enhancements related to element IDs such as deterministic and document-unique hashes. The way partitioning functions operate hasn't changed, which means unique_element_ids continues to be False by default, utilizing text hashes.

Features

Fixes

  • Add support for extracting text from tag tails in HTML. This fix adds ability to generate separate elements using tag tails.
  • Add support for extracting text from <b> tags in HTML Now partition_html() can extract text from <b> tags inside container tags (like <div>, <pre>).
  • Fix pip-compile make target Missing base.in dependency missing from requirments make file added