#5265 - Extract paragraph structure from PDF files #5266

reckart · 2025-01-27T20:25:53Z

What's in the PR

Remove some unused legacy classes
Set up a basic HTML structure in the CASes extracted from PDF files based on the paragraph detection from pdfbox

How to test manually

Import PDF
Switch to Apache Annotator in the annotation editor

Automatic testing

PR includes unit tests

Documentation

PR updates documentation

- Remove some unused legacy classes - Set up a basic HTML structure in the CASes extracted from PDF files based on the paragraph detection from pdfbox

…itly cache it again (I think)

- Added missing dependency

reckart added ⭐️ Enhancement New feature or request Module: PDF editor Module: Apache Annotator labels Jan 27, 2025

reckart added this to the 36.0 milestone Jan 27, 2025

reckart self-assigned this Jan 27, 2025

#5265 - Extract paragraph structure from PDF files

483c81f

- Remove some unused legacy classes - Set up a basic HTML structure in the CASes extracted from PDF files based on the paragraph detection from pdfbox

reckart force-pushed the feature/5265-Extract-paragraph-structure-from-PDF-files branch from 7a3ad3e to 483c81f Compare January 27, 2025 21:33

reckart added 2 commits January 27, 2025 22:37

No issue: Setup java already caches maven, so we don't need to explic…

79a1778

…itly cache it again (I think)

#5265 - Extract paragraph structure from PDF files

1a2638d

- Added missing dependency

reckart merged commit 3d779a5 into main Jan 28, 2025
3 checks passed

reckart deleted the feature/5265-Extract-paragraph-structure-from-PDF-files branch January 28, 2025 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#5265 - Extract paragraph structure from PDF files #5266

#5265 - Extract paragraph structure from PDF files #5266

reckart commented Jan 27, 2025

#5265 - Extract paragraph structure from PDF files #5266

#5265 - Extract paragraph structure from PDF files #5266

Conversation

reckart commented Jan 27, 2025