You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
Enhancements
Add intra-chunk overlap capability. Implement overlap for split-chunks where text-splitting is used to divide an oversized chunk into two or more chunks that fit in the chunking window. Note this capability is not yet available from the API but will shortly be made accessible using a new overlap kwarg on partition functions.
Update encoders to leverage dataclasses All encoders now follow a class approach which get annotated with the dataclass decorator. Similar to the connectors, it uses a nested dataclass for the configs required to configure a client as well as a field/property approach to cache the client. This makes sure any variable associated with the class exists as a dataclass field.
Features
Add Qdrant destination connector. Adds support for writing documents and embeddings into a Qdrant collection.
Store base64 encoded image data in metadata fields. Rather than saving to file, stores base64 encoded data of the image bytes and the mimetype for the image in metadata fields: image_base64 and image_mime_type (if that is what the user specifies by some other param like pdf_extract_to_payload). This would allow the API to have parity with the library.
Fixes
Fix table structure metric script Update the call to table agent to now provide OCR tokens as required
Fix element extraction not working when using "auto" strategy for pdf and image If element extraction is specified, the "auto" strategy falls back to the "hi_res" strategy.
Fix a bug passing a custom url to partition_via_api Users that self host the api were not able to pass their custom url to partition_via_api.