Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parsing pdf error - new_cells as str has no "copy" #3130

Merged
merged 5 commits into from
Jun 3, 2024

Conversation

christinestraub
Copy link
Collaborator

@christinestraub christinestraub commented May 31, 2024

Closes #3119.

Testing

Parsing the provided PDF should be successful.

testing_brochure_2.pdf

filename = "testing_brochure_2.pdf"
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        infer_table_structure=True,
        extract_image_block_types=["Image", "Table"],
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
    )
print("\n\n".join([str(el) for el in elements]))

@cragwolfe
Copy link
Contributor

Please add file to CI or in an ingest CI test (which has the added benefit of the outputs being browseable).

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@christinestraub christinestraub added this pull request to the merge queue Jun 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024
@cragwolfe
Copy link
Contributor

Please add file to CI or in an ingest CI test (which has the added benefit of the outputs being browseable).

Per slack convo, this can be addressed in a separate PR.

@christinestraub christinestraub added this pull request to the merge queue Jun 3, 2024
Merged via the queue into main with commit 1dede50 Jun 3, 2024
46 checks passed
@christinestraub christinestraub deleted the fix/3119-pdf-empty-table-cell branch June 3, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug/parsing pdf error - new_cells as str has no "copy"
3 participants