-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text splitter implementations #77
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some comments. at least a javadoc or simple comment for each public class.
sometimes, I was also skipping some comments, but let's aim for it.
@@ -0,0 +1,30 @@ | |||
package ai.knowly.langtorch.parser.textsplitter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ai.knowly.langtorch.preprocessing.splitter.text?
Could you also move the parser into this preprocessing? I feel like we probably have image, csv, etc stuff. Like pdf for pdf chatbot very soon.
|
||
@Test | ||
public void testCharacterTextSplitter_splitByCharactersSplitsNotFoundEasily() { | ||
// Arrange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comma at the end.
i.e. // Arrange => // Arrange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito for other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i found out there's already implemented util methods from Apache commons, i didnt the RecursiveCahracterTextSplitter and the test. I think we probably can import many methods from that library which significantly reduce the code size
abstract public List<String> splitText(String text); | ||
|
||
public List<DomainDocument> createDocuments(List<String> texts, @Nullable List<Map<String, String>> metaDatas) { | ||
List<Map<String, String>> _metadatas; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java variable name should be in lowerCamelCase
https://google.github.io/styleguide/javaguide.html#s5.2.7-local-variable-names
|
||
abstract public List<String> splitText(String text); | ||
|
||
public List<DomainDocument> createDocuments(List<String> texts, @Nullable List<Map<String, String>> metaDatas) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a java object for metaDatas and make it optional(then we can remove @nullable)
int newLinesCount = chunk.split("\n").length - 1; | ||
|
||
Map<String, String> loc; | ||
//todo should we also check what type of object is "loc"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo should looks like this:
// TODO: Handle null user condition.
} | ||
ArrayList<DomainDocument> documents = new ArrayList<>(); | ||
|
||
for (int i = 0; i < texts.size(); i += 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for(String text: texts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the index later on though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between CharacterTextSplitter and RecursiveCharacterTextSplitter? bit lost and should have more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comments explaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkout org.apache.commons:commons-lang3 which contains many util methods we can use here.
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some limits on where we can use this in the recursive character splitter. the recursion causes a stack overflow error
Please also resolve the code smell check here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review is not yet done but I will continue later today
@@ -116,6 +116,9 @@ dependencies { | |||
testImplementation 'com.squareup.retrofit2:retrofit-mock:2.9.0' | |||
implementation 'com.squareup.okhttp3:logging-interceptor:4.9.2' | |||
implementation "com.squareup.retrofit2:converter-gson:2.9.0" | |||
// Apache commons lang | |||
implementation group: 'org.apache.commons', name: 'commons-lang3', version: '3.0' | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementation org.apache.commons:commons-lang3:3.0 ? to make it consistent with other dependency?
|
||
public class CharacterTextSplitter extends TextSplitter { | ||
|
||
public String separator = "\n\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static final String SEPARATOR = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to give the developer the option to overwrite the default value via the constructor, I've made it static though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i didnt notice you are trying to change it later. please remove static then if that's the case.
splits = Arrays.asList(StringUtils.splitByWholeSeparatorPreserveAllTokens(text, this.separator)); | ||
} else { | ||
splits = Arrays.asList(StringUtils.splitByWholeSeparatorPreserveAllTokens(text, "")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splits = Arrays.asList(StringUtils.splitByWholeSeparatorPreserveAllTokens(text, this.separator.isEmpty() ? "" : this.separator));
public List<String> splitText(String text) { | ||
List<String> splits; | ||
|
||
if (StringUtils.isNotEmpty(this.separator)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be careful of this method which is not the same as isEmpty in String class
int total = 0; | ||
|
||
for (String d : splits) { | ||
int _len = d.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len = d.length();
we dont use _ to indicate it's a private variable like in python,
|
||
public class Metadatas { | ||
|
||
private final List<Map<String, String>> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Multimap instead if there's more than one value associated with the same key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice, was not aware of this. It's why before I was thinking we might need to nest a map
int chunkSize = 2; | ||
int chunkOverlap = 4; | ||
|
||
// Act |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Act.
}}) | ||
); | ||
|
||
assertEquals(expectedDocs.size(), docs.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use assertThat from google truth library and you can find example in other tests to be consistent
|
||
public int chunkSize; | ||
|
||
public int chunkOverlap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make these two fina variable if you are not trying to modify it later
|
||
if (total + length + (currentDoc.size() > 0 ? separator.length() : 0) > this.chunkSize) { | ||
if (total > this.chunkSize) { | ||
System.out.println("Created a chunk of size " + total + ", which is longer than the specified " + this.chunkSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logger instead.
example: src/main/java/ai/knowly/langtorch/example/SimpleChatBotWithoutExplicitKey.java
// Assert. | ||
List<String> expected = new ArrayList<>(Arrays.asList("a a", "foo", "bar", "baz")); | ||
|
||
Truth.assertThat(Objects.equals(expected, result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use assertThat(It's a static function and you can import it directly) and remove Truth.
// Assert. | ||
List<String> expected = new ArrayList<>(Arrays.asList("a a", "foo", "bar", "baz")); | ||
|
||
Truth.assertThat(Objects.equals(expected, result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertThat func is not correctly invoked.
assertThat(actual value).isEqualTo(expected value)
|
||
public class CharacterTextSplitter extends TextSplitter { | ||
|
||
public String separator = "\n\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i didnt notice you are trying to change it later. please remove static then if that's the case.
|
||
public final int chunkOverlap; | ||
|
||
public TextSplitter(int chunkSize, int chunkOverlap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it protected. Constructors of abstract classes can only be called in constructors of their subclasses
return value; | ||
} | ||
|
||
public static Metadata createEmpty(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create() is enough
|
||
ArrayList<DomainDocument> documents = new ArrayList<>(); | ||
|
||
for (int i = 0; i < texts.size(); i += 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are relying on the order to retrieve texts and corresponding metadata based on the assumption that text and docMetadata: 1. it's immutable and 2. order matters in the list.
This is error prone as there's an underlying coupling between these two. why not just create one class that contains both text and metadata so that the input of createDocument will then be a just single list that has no restriction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense and simplifies the code
public class Metadata { | ||
private final MultiKeyMap<String, String> value; | ||
|
||
public Metadata(MultiKeyMap<String, String> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are doing a fluent style(meaning our API design relies significantly on method chaining) as you did for createEmpty() that returns the class itself, then make it private.
Having a public constructor breaks the "fluent" style.
Why not do Metadata.create().set(a, b).set(c,d);
or if you already having a map, then do Metadata.copyOf(some map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks for the feedback. My only concern with the MultiKey is that it requires a minimum of two keys which we don't always need. For now I'm passing an empty string as one of the keys. What do you think?
|
||
abstract public List<String> splitText(String text); | ||
|
||
public List<DomainDocument> createDocuments(List<String> texts, Optional<List<Metadata>> docMetadatas) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also create a method called createDocument(...) for single text and metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed this method createDocumentsSplitFromList
, have created another method called createDocumentsSplitFromSingle
and a private method addDocumentFromWordChunk
} | ||
|
||
@Nullable | ||
private String joinDocs(List<String> docs, String separator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it Optional and remove Nullable.
Hi Ayo, I think we should discuss the implementation first before implementing it. IIUC, CharacterTextSplitter does the split work on a character level. here's one of your tests:
I don't get why the expected result is foo and bar. foo and bar are of size 3 which exceed the chunk size 2. My thought process: Same for
initial string: |
Kudos, SonarCloud Quality Gate passed! |
I am adding my view of the text splitter implementation. Let's talk before resolving this. |
No description provided.