Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text splitter implementations #77

Merged
merged 16 commits into from
May 30, 2023
Merged

Conversation

AyoTheDev
Copy link
Contributor

No description provided.

Copy link
Owner

@li2109 li2109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comments. at least a javadoc or simple comment for each public class.

sometimes, I was also skipping some comments, but let's aim for it.

@@ -0,0 +1,30 @@
package ai.knowly.langtorch.parser.textsplitter;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ai.knowly.langtorch.preprocessing.splitter.text?

Could you also move the parser into this preprocessing? I feel like we probably have image, csv, etc stuff. Like pdf for pdf chatbot very soon.


@Test
public void testCharacterTextSplitter_splitByCharactersSplitsNotFoundEasily() {
// Arrange
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comma at the end.
i.e. // Arrange => // Arrange.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito for other comments.

Copy link
Owner

@li2109 li2109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i found out there's already implemented util methods from Apache commons, i didnt the RecursiveCahracterTextSplitter and the test. I think we probably can import many methods from that library which significantly reduce the code size

abstract public List<String> splitText(String text);

public List<DomainDocument> createDocuments(List<String> texts, @Nullable List<Map<String, String>> metaDatas) {
List<Map<String, String>> _metadatas;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


abstract public List<String> splitText(String text);

public List<DomainDocument> createDocuments(List<String> texts, @Nullable List<Map<String, String>> metaDatas) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a java object for metaDatas and make it optional(then we can remove @nullable)

int newLinesCount = chunk.split("\n").length - 1;

Map<String, String> loc;
//todo should we also check what type of object is "loc"?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo should looks like this:
// TODO: Handle null user condition.

}
ArrayList<DomainDocument> documents = new ArrayList<>();

for (int i = 0; i < texts.size(); i += 1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for(String text: texts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the index later on though

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between CharacterTextSplitter and RecursiveCharacterTextSplitter? bit lost and should have more comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments explaining

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkout org.apache.commons:commons-lang3 which contains many util methods we can use here.

https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some limits on where we can use this in the recursive character splitter. the recursion causes a stack overflow error

@li2109
Copy link
Owner

li2109 commented May 25, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs Vulnerability A 0 Vulnerabilities Security Hotspot A 0 Security Hotspots Code Smell A 12 Code Smells

0.0% 0.0% Coverage 0.0% 0.0% Duplication

Please also resolve the code smell check here.

Copy link
Owner

@li2109 li2109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review is not yet done but I will continue later today

@@ -116,6 +116,9 @@ dependencies {
testImplementation 'com.squareup.retrofit2:retrofit-mock:2.9.0'
implementation 'com.squareup.okhttp3:logging-interceptor:4.9.2'
implementation "com.squareup.retrofit2:converter-gson:2.9.0"
// Apache commons lang
implementation group: 'org.apache.commons', name: 'commons-lang3', version: '3.0'

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation org.apache.commons:commons-lang3:3.0 ? to make it consistent with other dependency?


public class CharacterTextSplitter extends TextSplitter {

public String separator = "\n\n";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private static final String SEPARATOR = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to give the developer the option to overwrite the default value via the constructor, I've made it static though

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didnt notice you are trying to change it later. please remove static then if that's the case.

splits = Arrays.asList(StringUtils.splitByWholeSeparatorPreserveAllTokens(text, this.separator));
} else {
splits = Arrays.asList(StringUtils.splitByWholeSeparatorPreserveAllTokens(text, ""));
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splits = Arrays.asList(StringUtils.splitByWholeSeparatorPreserveAllTokens(text, this.separator.isEmpty() ? "" : this.separator));

public List<String> splitText(String text) {
List<String> splits;

if (StringUtils.isNotEmpty(this.separator)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be careful of this method which is not the same as isEmpty in String class

int total = 0;

for (String d : splits) {
int _len = d.length();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len = d.length();

we dont use _ to indicate it's a private variable like in python,


public class Metadatas {

private final List<Map<String, String>> values;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Multimap instead if there's more than one value associated with the same key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice, was not aware of this. It's why before I was thinking we might need to nest a map

int chunkSize = 2;
int chunkOverlap = 4;

// Act
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Act.

}})
);

assertEquals(expectedDocs.size(), docs.size());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use assertThat from google truth library and you can find example in other tests to be consistent


public int chunkSize;

public int chunkOverlap;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make these two fina variable if you are not trying to modify it later


if (total + length + (currentDoc.size() > 0 ? separator.length() : 0) > this.chunkSize) {
if (total > this.chunkSize) {
System.out.println("Created a chunk of size " + total + ", which is longer than the specified " + this.chunkSize);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logger instead.

example: src/main/java/ai/knowly/langtorch/example/SimpleChatBotWithoutExplicitKey.java

// Assert.
List<String> expected = new ArrayList<>(Arrays.asList("a a", "foo", "bar", "baz"));

Truth.assertThat(Objects.equals(expected, result));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use assertThat(It's a static function and you can import it directly) and remove Truth.

// Assert.
List<String> expected = new ArrayList<>(Arrays.asList("a a", "foo", "bar", "baz"));

Truth.assertThat(Objects.equals(expected, result));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertThat func is not correctly invoked.
assertThat(actual value).isEqualTo(expected value)


public class CharacterTextSplitter extends TextSplitter {

public String separator = "\n\n";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didnt notice you are trying to change it later. please remove static then if that's the case.


public final int chunkOverlap;

public TextSplitter(int chunkSize, int chunkOverlap) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it protected. Constructors of abstract classes can only be called in constructors of their subclasses

return value;
}

public static Metadata createEmpty(){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create() is enough


ArrayList<DomainDocument> documents = new ArrayList<>();

for (int i = 0; i < texts.size(); i += 1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are relying on the order to retrieve texts and corresponding metadata based on the assumption that text and docMetadata: 1. it's immutable and 2. order matters in the list.

This is error prone as there's an underlying coupling between these two. why not just create one class that contains both text and metadata so that the input of createDocument will then be a just single list that has no restriction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense and simplifies the code

public class Metadata {
private final MultiKeyMap<String, String> value;

public Metadata(MultiKeyMap<String, String> values) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are doing a fluent style(meaning our API design relies significantly on method chaining) as you did for createEmpty() that returns the class itself, then make it private.

Having a public constructor breaks the "fluent" style.
Why not do Metadata.create().set(a, b).set(c,d);
or if you already having a map, then do Metadata.copyOf(some map)

Copy link
Contributor Author

@AyoTheDev AyoTheDev May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks for the feedback. My only concern with the MultiKey is that it requires a minimum of two keys which we don't always need. For now I'm passing an empty string as one of the keys. What do you think?


abstract public List<String> splitText(String text);

public List<DomainDocument> createDocuments(List<String> texts, Optional<List<Metadata>> docMetadatas) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also create a method called createDocument(...) for single text and metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this method createDocumentsSplitFromList, have created another method called createDocumentsSplitFromSingle and a private method addDocumentFromWordChunk

}

@Nullable
private String joinDocs(List<String> docs, String separator) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it Optional and remove Nullable.

@li2109
Copy link
Owner

li2109 commented May 27, 2023

Hi Ayo, I think we should discuss the implementation first before implementing it.

IIUC, CharacterTextSplitter does the split work on a character level.

here's one of your tests:

  @Test
  public void testCharacterTextSplitter_splitByCharacterCountWithNoEmptyDocuments() {
    // Arrange.
    String text = "foo bar";
    CharacterTextSplitter splitter = new CharacterTextSplitter(" ", 2, 0);

    // Act.
    List<String> result = splitter.splitText(text);

    // Assert.
    List<String> expected = new ArrayList<>(Arrays.asList("foo", "bar"));

    Truth.assertThat(result).containsExactlyElementsIn(expected);
  }

I don't get why the expected result is foo and bar. foo and bar are of size 3 which exceed the chunk size 2.

My thought process:
initial: "foo bar"
after separator processing: it becomes foo and bar
with chunk size = 2, and no overlap, it should be fo, ob, and ar
why the result is foo and bar?

Same for

  @Test
  public void testCharacterTextSplitter_splitByCharacterCountLongWords() {
    // Arrange.
    String text = "foo bar baz a a";

    CharacterTextSplitter splitter = new CharacterTextSplitter(" ", 3, 1);

    // Act.
    List<String> result = splitter.splitText(text);

    // Assert.
    List<String> expected = new ArrayList<>(Arrays.asList("foo", "bar", "baz", "a a"));

    Truth.assertThat(Objects.equals(expected, result));
  }

initial string: foo bar baz a a
after separator processing: foo, bar, baz, a, a
with chunk size = 3 and overlap = 1,
it should be foo, oba, arb, baz, zaa

@sonarcloud
Copy link

sonarcloud bot commented May 30, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 7 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@li2109
Copy link
Owner

li2109 commented May 30, 2023

I am adding my view of the text splitter implementation. Let's talk before resolving this.

@li2109 li2109 merged commit 99505de into li2109:master May 30, 2023
@AyoTheDev AyoTheDev deleted the text-splitter branch June 3, 2023 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants