fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. #8729

davidsbatista · 2025-01-16T11:57:40Z

Related Issues

fixes Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Proposed Changes:

adding double new lines between container_texts so that passages can be detected
added tests for this new change using PDF files from our test_files
I added one more test for page detection to make sure this update doesn't change its behavior

How did you test it?

manual verification and unit tests
CI tests

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

…be detected

coveralls · 2025-01-16T12:03:20Z

Pull Request Test Coverage Report for Build 12829355379

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.004%) to 91.306%

Totals
Change from base Build 12829186498:	0.004%
Covered Lines:	8853
Relevant Lines:	9696

💛 - Coveralls

mpangrazzi

LGTM

davidsbatista · 2025-01-16T15:23:21Z

@anakin87 @julian-risch do you also want to have a quick look on this one?

anakin87

Looks good.

I would take the opportunity to also change the following line:

haystack/haystack/components/converters/pdfminer.py

Line 159 in 62ac27c

    
           pdf_reader = extract_pages(io.BytesIO(bytestream.data), laparams=self.layout_params)

pdf_reader is misleading here: these are pages

davidsbatista added 2 commits January 16, 2025 11:44

initial import

099f71d

adding double new lines between container_texts so that passages can …

310def3

…be detected

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 16, 2025

reducing type specification to avoid import error

8e96deb

adding release notes

8466b8c

davidsbatista changed the title ~~fix: pdfminer passage fix, adding double new lines between container_text so that passages can be detected~~ fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. Jan 16, 2025

davidsbatista mentioned this pull request Jan 16, 2025

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Closed

davidsbatista marked this pull request as ready for review January 16, 2025 13:52

davidsbatista requested review from a team as code owners January 16, 2025 13:52

davidsbatista requested review from dfokina and mpangrazzi and removed request for a team January 16, 2025 13:52

mpangrazzi approved these changes Jan 16, 2025

View reviewed changes

anakin87 reviewed Jan 16, 2025

View reviewed changes

davidsbatista added 2 commits January 17, 2025 12:15

renaming variable

3915ff3

Merge branch 'main' into pdfminer-passage-fix

8d345f0

davidsbatista enabled auto-merge (squash) January 17, 2025 11:16

Merge branch 'main' into pdfminer-passage-fix

c125e81

davidsbatista merged commit 5af2888 into main Jan 17, 2025
18 checks passed

davidsbatista deleted the pdfminer-passage-fix branch January 17, 2025 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. #8729

fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. #8729

davidsbatista commented Jan 16, 2025 •

edited

Loading

coveralls commented Jan 16, 2025 •

edited

Loading

mpangrazzi left a comment

davidsbatista commented Jan 16, 2025

anakin87 left a comment

fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. #8729

fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. #8729

Conversation

davidsbatista commented Jan 16, 2025 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Checklist

coveralls commented Jan 16, 2025 • edited Loading

Pull Request Test Coverage Report for Build 12829355379

Details

💛 - Coveralls

mpangrazzi left a comment

Choose a reason for hiding this comment

davidsbatista commented Jan 16, 2025

anakin87 left a comment

Choose a reason for hiding this comment

fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. #8729

fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. #8729

davidsbatista commented Jan 16, 2025 •

edited

Loading

coveralls commented Jan 16, 2025 •

edited

Loading