Provide new chunking strategies in localdocs #2635

manyoso · 2024-07-10T14:12:56Z

Currently we do a character/word based chunking that is very simple. We should enhance our chunking strategies to possibly include:

Recursive Character Chunking
Token Based Chunking
Document Specific Chunking (HTML, MD, Python, CPP, etc)
Semantic Chunking

Here is some possible literature:

ThiloteE · 2024-07-10T20:51:53Z

Semantic Chunking in practice: https://boudhayan-dev.medium.com/semantic-chunking-in-practice-23a8bc33d56d
Basic RAG vs Advanced RAG: https://medium.com/llamaindex-blog/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b

I also think a very natural long term goal for GPT4All could be having responses based on Agents using knowledge graphs fed in via RAG using Nomic Maps (but that goes beyond a simple "chunking strategy").

kalle07 · 2024-07-20T10:05:49Z

is it at least possible to change easy embedding models ?
(i dont know EN and Cina seems OK, but maybe 5% are german user)
https://huggingface.co/aari1995/German_Semantic_V3

Issue: #2635 Signed-off-by: Adam Treat <treat.adam@gmail.com>

ThiloteE · 2024-08-17T11:09:29Z

Somebody went all in on RegEx lol

Jina AI
Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex.

Source: https://x.com/JinaAI_/status/1823756993108304135

manyoso · 2024-08-17T12:27:54Z

Somebody went all in on RegEx lol

Jina AI
Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex.
Source: https://x.com/JinaAI_/status/1823756993108304135

This is assuming the text is even structured properly. The pdf text we get right now does not have formatting really to regex on very much.

manyoso added enhancement New feature or request local-docs chat gpt4all-chat issues labels Jul 10, 2024

manyoso added a commit that referenced this issue Aug 16, 2024

Implements recursive text based chunking.

d98f34f

Issue: #2635 Signed-off-by: Adam Treat <treat.adam@gmail.com>

manyoso mentioned this issue Aug 16, 2024

Implements recursive text based chunking. #2879

Draft

manyoso self-assigned this Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide new chunking strategies in localdocs #2635

Provide new chunking strategies in localdocs #2635

manyoso commented Jul 10, 2024

ThiloteE commented Jul 10, 2024 •

edited

Loading

kalle07 commented Jul 20, 2024

ThiloteE commented Aug 17, 2024

manyoso commented Aug 17, 2024 •

edited

Loading

Provide new chunking strategies in localdocs #2635

Provide new chunking strategies in localdocs #2635

Comments

manyoso commented Jul 10, 2024

ThiloteE commented Jul 10, 2024 • edited Loading

kalle07 commented Jul 20, 2024

ThiloteE commented Aug 17, 2024

manyoso commented Aug 17, 2024 • edited Loading

ThiloteE commented Jul 10, 2024 •

edited

Loading

manyoso commented Aug 17, 2024 •

edited

Loading