Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide new chunking strategies in localdocs #2635

Open
manyoso opened this issue Jul 10, 2024 · 4 comments
Open

Provide new chunking strategies in localdocs #2635

manyoso opened this issue Jul 10, 2024 · 4 comments
Assignees
Labels
chat gpt4all-chat issues enhancement New feature or request local-docs

Comments

@manyoso
Copy link
Collaborator

manyoso commented Jul 10, 2024

Currently we do a character/word based chunking that is very simple. We should enhance our chunking strategies to possibly include:

  • Recursive Character Chunking
  • Token Based Chunking
  • Document Specific Chunking (HTML, MD, Python, CPP, etc)
  • Semantic Chunking

Here is some possible literature:

@manyoso manyoso added enhancement New feature or request local-docs chat gpt4all-chat issues labels Jul 10, 2024
@ThiloteE
Copy link
Collaborator

ThiloteE commented Jul 10, 2024

Semantic Chunking in practice: https://boudhayan-dev.medium.com/semantic-chunking-in-practice-23a8bc33d56d
Basic RAG vs Advanced RAG: https://medium.com/llamaindex-blog/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b

I also think a very natural long term goal for GPT4All could be having responses based on Agents using knowledge graphs fed in via RAG using Nomic Maps (but that goes beyond a simple "chunking strategy").

@kalle07
Copy link

kalle07 commented Jul 20, 2024

is it at least possible to change easy embedding models ?
(i dont know EN and Cina seems OK, but maybe 5% are german user)
https://huggingface.co/aari1995/German_Semantic_V3

manyoso added a commit that referenced this issue Aug 16, 2024
Issue: #2635

Signed-off-by: Adam Treat <treat.adam@gmail.com>
@manyoso manyoso self-assigned this Aug 16, 2024
@ThiloteE
Copy link
Collaborator

Somebody went all in on RegEx lol

Jina AI
Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex.

GU9E5Z6X0AEQrxX

Source: https://x.com/JinaAI_/status/1823756993108304135

@manyoso
Copy link
Collaborator Author

manyoso commented Aug 17, 2024

Somebody went all in on RegEx lol

Jina AI
Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex.
Source: https://x.com/JinaAI_/status/1823756993108304135

This is assuming the text is even structured properly. The pdf text we get right now does not have formatting really to regex on very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chat gpt4all-chat issues enhancement New feature or request local-docs
Projects
Development

No branches or pull requests

3 participants