Proposition of alternative to RecursiveCharacterTextSplitter (RCTS) #27369
ArturOle
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TL;DR - I was tinkering with my project and prepared a new Text splitter. I'm wondering if the langchain would be interested in implementing it into its codebase. As of now, it is 3X faster than RCTS and scales linearly. I will prepare tests, refactor, and prepare everything to fit your "style" if needed. Tell me what you think and if I should continue my work to prepare a PR.
About the solution
I had just one dependency from langchain in my project which was a RecursiveCharacterTextSplitter and it was itching me for some time. I decided to replace the RCTS with something of my own and optimize it a little so I created the non-recursive version. First I made a simple TS with static splits, extended it with overlaps, and then introduced the margin to decrease the search space of "good" split points. The algorithm operates on the indexes most of the time and checks characters only in specific sectors of the text (just before the chunk size limit for each chunk, overlap included).
Additional parameters are:
From the initial analysis, it looks like it is 3X faster than the RecursiveCharacterTextSplitter and has text length distributions closer to the chunk size. Of course, more test has to be done, maybe for a bigger scale and different parameter values but this is what I have at this point.
Comparisons, performance tests, and chunk size distributions are available in the notebook (Look TextSplitterV3 for the implementation, the algorithm description is deprecated as now the output is not reversed):
notebook
The question
Does such an algorithm can be useful for your users' needs? I will take care of tests, integration with the current codebase, and documentation. I will also implement the other RCTS functionalities like splitting code.
There are still minor problems but it will not be an issue to solve. For example, checking for a split position at the end of the chunk where I reverse the string and use regex on it yuck. Working on it.
Feel free to ask questions if some parts are unclear.
Tell me what you think and if I should continue my work to prepare a PR.
First time contributing.
Beta Was this translation helpful? Give feedback.
All reactions