Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfctr: prep for pluggable partitioners #3806

Merged
merged 16 commits into from
Dec 10, 2024
Merged

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Dec 3, 2024

Summary
Prepare auto-partitioning for pluggable partitioners.

Move toward a uniform partitioner call signature in auto/partition() such that a custom or override partitioner can be registered without requiring code changes.

Additional Context
The central job of auto/partition() is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. partition_pdf() or partition_docx().

In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the partition() function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as kwargs and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this.

So the job is to pass all arguments (other than filename and file) to the partitioner as kwargs. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature partition_x(filename, file, kwargs) -> list[Element] then they can be dispatched to without customization.

@scanny scanny force-pushed the scanny/pluggable-partitioner-prep branch 2 times, most recently from 7255b54 to 1584d00 Compare December 9, 2024 19:44
@scanny scanny requested a review from Coniferish December 9, 2024 19:49
@scanny scanny force-pushed the scanny/pluggable-partitioner-prep branch from 1584d00 to 45e3fc6 Compare December 9, 2024 22:42
Legacy predecessor to `metadata_filename`. Deprecated for over a year.
Add the flexibility to have multiple return-points within `partition()`
by extracting the post-processing performed after partitioning to a
function that can be called from whatever exit-points are convenient.
These are going to be harder to unify the signature of, if we ever do,
so get these out of the way up front.
Also remove unused public property of `HtmlPartitionerOptions` noticed
while we were in there.
Now that the call signature for non-PDF/Image file-types is uniform,
dispatch can be reduced to a single call.

This sets the stage for calling a partitioner unknown at compile-time,
one registered at run-time.
@scanny scanny force-pushed the scanny/pluggable-partitioner-prep branch from 45e3fc6 to 21342da Compare December 10, 2024 19:30
@scanny scanny added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit 3b718ec Dec 10, 2024
41 checks passed
@scanny scanny deleted the scanny/pluggable-partitioner-prep branch December 10, 2024 21:23
Copy link

sentry-io bot commented Dec 16, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ RuntimeError: Pandoc died with exitcode "64" during conversion: Unable to find element: QName {qName = "metadat... /general/v0/general View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants