You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Enhancements
Remove duplicate image elements. Remove image elements identified by PDFMiner that have similar bounding boxes and the same text.
Add support for start_index in html links extraction
Add strategy arg value to _PptxPartitionerOptions. This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning.
Support pluggable sub-partitioner for PPTX Picture shapes. Use a distinct sub-partitioner for partitioning PPTX Picture (image) shapes and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.
Introduce starting_page_number parameter to partitioning functions It applies to those partitioners which support page_number in element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX.
Redesign the internal mechanism of assigning element IDs This allows for further enhancements related to element IDs such as deterministic and document-unique hashes. The way partitioning functions operate hasn't changed, which means unique_element_ids continues to be False by default, utilizing text hashes.
Features
Fixes
Add support for extracting text from tag tails in HTML. This fix adds ability to generate separate elements using tag tails.
Add support for extracting text from <b> tags in HTML Now partition_html() can extract text from <b> tags inside container tags (like <div>, <pre>).
Fix pip-compile make target Missing base.in dependency missing from requirments make file added