Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. #8729

Merged
merged 7 commits into from
Jan 17, 2025

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Jan 16, 2025

Related Issues

Proposed Changes:

  • adding double new lines between container_texts so that passages can be detected
  • added tests for this new change using PDF files from our test_files
  • I added one more test for page detection to make sure this update doesn't change its behavior

How did you test it?

  • manual verification and unit tests
  • CI tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 16, 2025
@coveralls
Copy link
Collaborator

coveralls commented Jan 16, 2025

Pull Request Test Coverage Report for Build 12829355379

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.004%) to 91.306%

Totals Coverage Status
Change from base Build 12829186498: 0.004%
Covered Lines: 8853
Relevant Lines: 9696

💛 - Coveralls

@davidsbatista davidsbatista changed the title fix: pdfminer passage fix, adding double new lines between container_text so that passages can be detected fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. Jan 16, 2025
@davidsbatista davidsbatista marked this pull request as ready for review January 16, 2025 13:52
@davidsbatista davidsbatista requested review from a team as code owners January 16, 2025 13:52
@davidsbatista davidsbatista requested review from dfokina and mpangrazzi and removed request for a team January 16, 2025 13:52
Copy link
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidsbatista
Copy link
Contributor Author

@anakin87 @julian-risch do you also want to have a quick look on this one?

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I would take the opportunity to also change the following line:

pdf_reader = extract_pages(io.BytesIO(bytestream.data), laparams=self.layout_params)

pdf_reader is misleading here: these are pages

@davidsbatista davidsbatista enabled auto-merge (squash) January 17, 2025 11:16
@davidsbatista davidsbatista merged commit 5af2888 into main Jan 17, 2025
18 checks passed
@davidsbatista davidsbatista deleted the pdfminer-passage-fix branch January 17, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Splitter always returns 1 document for split_type="passage" in pdfs
4 participants