Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Missing R2R Parsers, Image Ingestion #1702

Merged
merged 5 commits into from
Dec 17, 2024
Merged

Conversation

NolanTrem
Copy link
Collaborator

@NolanTrem NolanTrem commented Dec 16, 2024

Important

Implement missing parsers for various file types, enhance image ingestion, and update ingestion tests to cover all supported file types and modes.

  • Parsers:
    • Add BMPParser, DOCParser, ODTParser, PPTParser, RTFParser for media files.
    • Add EMLParser, EPUBParser, MSGParser, ORGParser, P7SParser, RSTParser, TIFFParser, TSVParser, XLSParser for structured files.
  • Ingestion:
    • Enhance image ingestion with HEIC support in ImageParser.
    • Improve error handling for PDF parsing with PDFParsingError and PopperNotFoundError.
  • Tests:
    • Add integration tests for ingestion covering all supported file types and modes in test_ingestion.py.
    • Add tests for chunk operations in test_chunks.py.
  • Dependencies:
    • Update pyproject.toml with new dependencies for added parsers.

This description was created by Ellipsis for 7e5394d. It will automatically update as commits are pushed.

@NolanTrem NolanTrem marked this pull request as ready for review December 17, 2024 01:05
@NolanTrem NolanTrem merged commit 0cee9e4 into main Dec 17, 2024
0 of 2 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 7e5394d in 2 minutes and 28 seconds

More details
  • Looked at 5244 lines of code in 69 files
  • Skipped 15 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. py/core/parsers/media/audio_parser.py:1
  • Draft comment:
    The import statement for 'base64' is unnecessary and can be removed as it is not used in the code.
  • Reason this comment was not posted:
    Confidence changes required: 10%
    The import statement for 'base64' is unnecessary in the audio_parser.py file as it is not used anywhere in the code.

Workflow ID: wflow_0qYjrGLH64VnPwJA


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@NolanTrem NolanTrem deleted the Nolan/ImageIngestion branch December 17, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant