Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to process an image chunk #137

Open
mattlindsey opened this issue Nov 12, 2024 · 2 comments
Open

How to process an image chunk #137

mattlindsey opened this issue Nov 12, 2024 · 2 comments

Comments

@mattlindsey
Copy link
Contributor

Now that a vision model can be specified in the settings and Archyve can ingest a jpg document into a single chunk, I think that I need some guidance on what to do with it next.

@oxaroky02 said I should "use Setting.vision_model during parsing to get a model and use that with the LLM client API helper to ask for a description via the #image method (See spec/lib/llm_clients/ollama/request_helper_spec.rb line 94)" which sounds good. But I am unclear on which field in the Chunks table the description should be stored in order to embed it properly and for the entity to get created properly (if knowledge graph is enabled).

I have started taking a stab at this, but could use a little help on the Chunks table, and also I am wondering if we need a field in the Chunks table to indicate the 'type' of chunk in order for the jobs to know how to process it. In this case we have an image chunk, which will be processed differently than text, so how do we indicate that?

@oxaroky02
Copy link
Collaborator

Hola @mattlindsey. Also, @nickthecook, let me know if I'm on the right track here.

The current ingest flow takes either a web link or uploaded document and then runs it through "document chunking". This makes sense when the content from the web or document yields textual content in some format.

When the link/document is an image (or audio or video ...) then we need to introduce a flow that can track the media, transform it into textual content and then run that through the chunking.

See #136 where we just separated the current (Fetch | Upload) -> Chunk flow out of the document controller into Mediator#ingest under apps/services.

The next step we're planning is to convert that into a (Fetch | Upload) -> Connvert? -> Chunk flow where the optional conversion will detect if the content is not text and convert to text for supported format.

Once this extra step is in place, this is where the work you started for images would come in, but instead of image -> chunking the image handling would be a "converter" that uses the vision model to produce "text" that would go through the chunking separately.

Let me know if I can clarify this; I may have missed some detail in my head that didn't get transcribed above. 😄

@mattlindsey
Copy link
Contributor Author

Hi @oxaroky02. Sounds good to me. It seems like you should work on these next steps, but let me know if I can help! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants