-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use content chunking for indexed items #318
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 2 Skipped Deployments
|
Caution Review failedThe pull request is closed. 📝 Walkthrough📝 WalkthroughWalkthroughThe changes introduce new functionality to the Changes
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (15)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
app/resources/datagraph/content.go (4)
305-306
: Add documentation for the constant.Consider adding a comment explaining the rationale behind the 350-character limit and any research/empirical data that supports this choice.
350-350
: Remove or implement the TODO comment.The TODO comment suggests incomplete implementation, but the code below appears to be implemented. Either remove the comment or clarify what additional splitting logic is needed.
360-410
: Consider renaming the function for clarity.The function name
splitearly
is not immediately clear about its purpose. Consider a more descriptive name likesplitTextAtBoundaries
orsplitTextWithFallback
.
412-425
: LGTM! Consider adding documentation.The implementation is efficient using
strings.Builder
. Consider adding documentation to explain the recursive approach for future maintainers.app/resources/datagraph/content_test.go (1)
129-149
: Add test cases for edge cases.While the basic test is good, consider adding test cases for:
- Empty content
- Content with only non-text nodes
- Content with nested block elements
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
app/resources/datagraph/content.go
(1 hunks)app/resources/datagraph/content_test.go
(1 hunks)
🔇 Additional comments (1)
app/resources/datagraph/content_test.go (1)
151-181
: LGTM! Comprehensive test for long content.
The test thoroughly verifies the splitting behavior with a complex HTML structure and various content types.
app/resources/datagraph/content.go
Outdated
func (c Content) Split() []string { | ||
r := []html.Node{} | ||
|
||
// first, walk the tree for the top-most block-content nodes. | ||
var walk func(n *html.Node) | ||
walk = func(n *html.Node) { | ||
if n.Type == html.ElementNode { | ||
switch n.DataAtom { | ||
case | ||
atom.H1, | ||
atom.H2, | ||
atom.H3, | ||
atom.H4, | ||
atom.H5, | ||
atom.H6, | ||
atom.Blockquote, | ||
atom.Pre, | ||
atom.P: | ||
r = append(r, *n) | ||
// once split, exit out of this branch | ||
return | ||
} | ||
} | ||
|
||
for c := n.FirstChild; c != nil; c = c.NextSibling { | ||
walk(c) | ||
} | ||
} | ||
walk(c.html) | ||
|
||
// now, iterate these top level nodes and split any that are "too big" | ||
chunks := chunksFromNodes(r, roughMaxSentenceSize) | ||
|
||
return chunks | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add nil check for HTML tree.
The Split()
method should handle the case where c.html
is nil to prevent potential panic.
func (c Content) Split() []string {
+ if c.html == nil {
+ return nil
+ }
r := []html.Node{}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
func (c Content) Split() []string { | |
r := []html.Node{} | |
// first, walk the tree for the top-most block-content nodes. | |
var walk func(n *html.Node) | |
walk = func(n *html.Node) { | |
if n.Type == html.ElementNode { | |
switch n.DataAtom { | |
case | |
atom.H1, | |
atom.H2, | |
atom.H3, | |
atom.H4, | |
atom.H5, | |
atom.H6, | |
atom.Blockquote, | |
atom.Pre, | |
atom.P: | |
r = append(r, *n) | |
// once split, exit out of this branch | |
return | |
} | |
} | |
for c := n.FirstChild; c != nil; c = c.NextSibling { | |
walk(c) | |
} | |
} | |
walk(c.html) | |
// now, iterate these top level nodes and split any that are "too big" | |
chunks := chunksFromNodes(r, roughMaxSentenceSize) | |
return chunks | |
} | |
func (c Content) Split() []string { | |
if c.html == nil { | |
return nil | |
} | |
r := []html.Node{} | |
// first, walk the tree for the top-most block-content nodes. | |
var walk func(n *html.Node) | |
walk = func(n *html.Node) { | |
if n.Type == html.ElementNode { | |
switch n.DataAtom { | |
case | |
atom.H1, | |
atom.H2, | |
atom.H3, | |
atom.H4, | |
atom.H5, | |
atom.H6, | |
atom.Blockquote, | |
atom.Pre, | |
atom.P: | |
r = append(r, *n) | |
// once split, exit out of this branch | |
return | |
} | |
} | |
for c := n.FirstChild; c != nil; c = c.NextSibling { | |
walk(c) | |
} | |
} | |
walk(c.html) | |
// now, iterate these top level nodes and split any that are "too big" | |
chunks := chunksFromNodes(r, roughMaxSentenceSize) | |
return chunks | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (8)
app/services/semdex/semdexer/weaviate_semdexer/indexer.go (1)
33-38
: indexChunk method consistency
The logic to derive the chunkID and check existence is well-structured. Just ensure that the chunkID generation remains consistent across the entire system (e.g., in delete flows and other references).app/resources/datagraph/content.go (1)
353-367
: chunksFromNodes
This function elegantly handles large text by delegating to splitearly(). One suggestion is to handle extremely large strings with care for potential memory usage, but for typical use cases, this is fine.app/services/semdex/semdexer/weaviate_semdexer/recommend.go (1)
37-43
: Single-chunk selection
Selecting just chunkIDs[0] is a short-term solution for demonstration. If you plan to handle multiple chunks, consider how to combine or average vectors for better recommendations.app/services/semdex/semdexer/weaviate_semdexer/mapping.go (2)
99-105
: Add return value documentationConsider adding a comment explaining the deterministic nature of the generated UUID and its uniqueness guarantees.
func generateChunkID(id xid.ID, chunk string) uuid.UUID { + // Returns a deterministic UUID v3 generated from the combination of object ID and chunk content. + // The UUID is guaranteed to be unique for each unique (id, chunk) pair. // We don't currently support sharing chunks across content nodes, so append // the object's ID to the chunk's hash, to ensure it's unique to the object. payload := []byte(append(id.Bytes(), chunk...))
107-115
: Reduce code duplication with generateChunkIDThe UUID generation logic is duplicated. Consider refactoring to reuse the existing
generateChunkID
function.func chunkIDsFor(id xid.ID) func(chunk string) uuid.UUID { return func(chunk string) uuid.UUID { - // We don't currently support sharing chunks across content nodes, so append - // the object's ID to the chunk's hash, to ensure it's unique to the object. - payload := []byte(append(id.Bytes(), chunk...)) - - return uuid.NewHash(fnv.New128(), uuid.NameSpaceOID, payload, 4) + return generateChunkID(id, chunk) } }app/resources/datagraph/content_test.go (3)
129-149
: Enhance test coverage with specific assertionsThe test only verifies the number of segments but doesn't validate their content. Consider adding assertions for the expected content of each segment.
ps := c.Split() a.Len(ps, 5) + a.Equal("Here's a paragraph. It's pretty neat.", ps[0]) + a.Equal("Here's the rest of the text.", ps[1]) + a.Equal("neat photo right?", ps[2]) + a.Equal("This is quite a long post, the summary, should just be the first 128 characters rounded down to the nearest space.", ps[3])
151-163
: Add content verification to minimal test caseThe test verifies the length but not the content. Consider adding an assertion to verify the correct handling of HTML entities.
ps := c.Split() a.Len(ps, 1) + a.Equal("I've tried everything, but for some reason it seems impossible to find datasets that includes a simple list of the councils in England sorted by their group, and a list of covid cases also sorted by councils. I'm not British so it may be a lack of knowledge of how their government sites work. Anyone know a place to find these?", ps[0])
165-195
: Consider using neutral test dataWhile the test coverage is thorough, the test data contains potentially controversial political content. Consider using more neutral Lorem Ipsum text or technical documentation for test cases.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
app/resources/datagraph/content.go
(1 hunks)app/resources/datagraph/content_test.go
(1 hunks)app/services/collection/collection_read/reader.go
(0 hunks)app/services/library/node_read/reader.go
(0 hunks)app/services/semdex/semdex.go
(0 hunks)app/services/semdex/semdexer/semdexer.go
(0 hunks)app/services/semdex/semdexer/weaviate_semdexer/delete.go
(1 hunks)app/services/semdex/semdexer/weaviate_semdexer/indexer.go
(3 hunks)app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go
(0 hunks)app/services/semdex/semdexer/weaviate_semdexer/mapping.go
(3 hunks)app/services/semdex/semdexer/weaviate_semdexer/recommend.go
(2 hunks)app/services/semdex/semdexer/weaviate_semdexer/relevance.go
(0 hunks)app/services/semdex/semdexer/weaviate_semdexer/search.go
(3 hunks)internal/infrastructure/weaviate/weaviate.go
(1 hunks)
💤 Files with no reviewable changes (6)
- app/services/semdex/semdexer/semdexer.go
- app/services/semdex/semdex.go
- app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go
- app/services/library/node_read/reader.go
- app/services/semdex/semdexer/weaviate_semdexer/relevance.go
- app/services/collection/collection_read/reader.go
🔇 Additional comments (18)
app/services/semdex/semdexer/weaviate_semdexer/indexer.go (7)
17-21
: Ensure content is non-empty before indexing
The code checks for empty chunks. This is good for avoiding unnecessary indexing. Consider adding logs or metrics to monitor how often empty content occurs if you need deeper visibility.
23-28
: Indexing each chunk
Looping through each chunk is a solid approach to split content for fine-grained indexing. Ensure downstream components handle multiple documents per resource properly, as the PR mentioned a shift to multiple vector index documents per resource.
30-31
: Early return on success
Returning nil after successful indexing is concise and clear. This approach is straightforward and makes the control flow easy to follow.
43-47
: Properties map
Using a map for dynamic properties is flexible. However, you might want to consider validating "content" lengths or trimming it if there's a size limit in Weaviate.
50-51
: Avoid redundant updates
You have a check to compare existing properties before updating, which avoids unnecessary writes. This is a good performance optimization.
60-66
: Creator vs. Updater
The approach of using Updater() when exists is true, and Creator() otherwise, is correct. Confirm that partial property updates in Weaviate do not overwrite any non-supplied properties.
85-108
: existsByContent function
The function effectively encapsulates the retrieval logic to check if an object already exists. Good approach! One recommendation is to add more robust error checks for non-404 status codes if you need granular error handling.
app/resources/datagraph/content.go (4)
304-307
: Constant for maximum sentence size
Defining a clear constant (roughMaxSentenceSize = 350) is helpful. This is a reasonable upper bound for typical sentences.
308-351
: Consider nil check on c.html
Similar to a previous comment, a nil check would prevent potential panics if c.html is unexpectedly nil.
369-419
: splitearly splitting logic
The recursive approach is effective for chunking text at sentence boundaries, falling back to spaces. Consider more robust boundary checks or language-specific libraries if you need better sentence detection in future improvements.
421-434
: textfromnode extraction
Straightforward text extraction. If performance becomes an issue, you might look into specialized HTML text extraction libraries. For now, this serves well.
app/services/semdex/semdexer/weaviate_semdexer/delete.go (1)
13-20
: Batch deletion approach
Utilizing the batch deleter with a filter on "datagraph_id" is more flexible than single-object deletion. This also sets the foundation for multi-chunk deletions if needed.
app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2)
32-35
: Handle empty chunk IDs
Returning nil early when chunkIDs is empty helps prevent unnecessary queries. This is a clean, safe check.
55-55
: Increased certainty
Raising the certainty threshold to 0.7 filters out less-confident recommendations. Make sure this threshold aligns with your desired user experience or business requirements.
app/services/semdex/semdexer/weaviate_semdexer/mapping.go (2)
Line range hint 61-66
: LGTM: Score parsing enhancement
The additional condition properly handles cases where the score is provided as a string instead of distance, with appropriate error handling.
121-127
: LGTM: Efficient slice conversion
The function correctly pre-allocates the result slice and performs the conversion efficiently.
app/services/semdex/semdexer/weaviate_semdexer/search.go (2)
10-11
: LGTM: Appropriate imports added for the new functionality
The new imports (xid for ID handling and lo for functional operations) align well with the implementation requirements.
110-112
: Verify pagination behavior with deduplication
The deduplication step is correctly placed before pagination. However, we should verify that the total count used for pagination remains accurate after deduplication.
Model: "text-embedding-3-small", | ||
Dimensions: "3072", | ||
Type: "text", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
There appears to be an inconsistency in the embedding model configuration across the codebase:
internal/infrastructure/weaviate/weaviate.go
usestext-embedding-3-small
app/services/semdex/semdexer/chromem_semdexer/chromem.go
usesEmbeddingModelOpenAI3Large
This mismatch could lead to inconsistent embedding dimensions and potential compatibility issues between different parts of the system.
- Consider aligning the embedding model choice across all components to ensure consistent behavior
- Update the
chromem_semdexer
to use the same model as Weaviate, or document why different models are needed
🔗 Analysis chain
LGTM! Verify OpenAI model compatibility.
The switch to text-embedding-3-small
is a cost-effective choice that aligns well with the PR's content splitting strategy. The dimensions parameter correctly matches OpenAI's specifications.
Let's verify the model configuration:
Consider adding a comment explaining the rationale for using the small model, such as:
ModuleConfig: map[string]ModuleConfig{
"text2vec-openai": {
+ // Using the small model for better cost-efficiency while maintaining
+ // good performance with chunked content
Model: "text-embedding-3-small",
Dimensions: "3072",
Type: "text",
},
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify OpenAI model configuration across the codebase
# Check for any hardcoded references to the old model
rg "text-embedding-3-large"
# Check for any configuration files that might override these settings
fd -e yaml -e json -e toml | xargs rg "text-embedding-3"
Length of output: 84
Script:
#!/bin/bash
# Let's try a broader search for model configurations and OpenAI-related settings
# Search for any OpenAI model configurations
rg -i "openai.*model|model.*openai" -A 3 -B 3
# Search for embedding-related configurations
rg -i "embedding.*model|model.*embedding" -A 3 -B 3
# Search for dimension configurations that might be related
rg "Dimensions.*3072|3072.*dimension" -A 3 -B 3
# Look for configuration files specifically
fd "config|settings" -e yaml -e json -e toml | xargs cat
Length of output: 5424
func dedupeChunks(results []*datagraph.Ref) []*datagraph.Ref { | ||
groupedByID := lo.GroupBy(results, func(r *datagraph.Ref) xid.ID { return r.ID }) | ||
|
||
// for each grouped result, compute the average score and flatten | ||
// the list of results into a single result per ID | ||
// this is a naive approach to deduplication | ||
|
||
list := lo.Values(groupedByID) | ||
|
||
deduped := dt.Reduce(list, func(acc []*datagraph.Ref, curr []*datagraph.Ref) []*datagraph.Ref { | ||
first := curr[0] | ||
score := 0.0 | ||
|
||
for _, r := range curr { | ||
score += r.Relevance | ||
} | ||
|
||
next := &datagraph.Ref{ | ||
ID: first.ID, | ||
Kind: first.Kind, | ||
Relevance: score / float64(len(curr)), | ||
} | ||
|
||
return append(acc, next) | ||
}, []*datagraph.Ref{}) | ||
|
||
return deduped | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical issues in deduplication implementation
The deduplication logic has several potential issues:
- It assumes all chunks from the same ID have identical Kind
- It drops potentially important fields from the Ref struct
- Results might need re-sorting by relevance after averaging
- No explicit handling of empty input
Consider this improved implementation:
func dedupeChunks(results []*datagraph.Ref) []*datagraph.Ref {
+ if len(results) == 0 {
+ return results
+ }
+
groupedByID := lo.GroupBy(results, func(r *datagraph.Ref) xid.ID { return r.ID })
- list := lo.Values(groupedByID)
-
- deduped := dt.Reduce(list, func(acc []*datagraph.Ref, curr []*datagraph.Ref) []*datagraph.Ref {
- first := curr[0]
- score := 0.0
-
- for _, r := range curr {
- score += r.Relevance
- }
-
- next := &datagraph.Ref{
- ID: first.ID,
- Kind: first.Kind,
- Relevance: score / float64(len(curr)),
- }
-
- return append(acc, next)
- }, []*datagraph.Ref{})
+ deduped := make([]*datagraph.Ref, 0, len(groupedByID))
+
+ for _, chunks := range groupedByID {
+ if len(chunks) == 0 {
+ continue
+ }
+
+ // Verify all chunks have the same Kind
+ kind := chunks[0].Kind
+ for _, c := range chunks[1:] {
+ if c.Kind != kind {
+ return nil, fault.New("inconsistent kinds for same ID")
+ }
+ }
+
+ // Calculate average relevance
+ totalScore := 0.0
+ for _, c := range chunks {
+ totalScore += c.Relevance
+ }
+
+ // Copy all fields from the most relevant chunk
+ mostRelevant := lo.MaxBy(chunks, func(a, b *datagraph.Ref) bool {
+ return a.Relevance > b.Relevance
+ })
+
+ merged := *mostRelevant // Copy all fields
+ merged.Relevance = totalScore / float64(len(chunks))
+
+ deduped = append(deduped, &merged)
+ }
+
+ // Re-sort by relevance
+ return lo.SortBy(deduped, func(a, b *datagraph.Ref) bool {
+ return a.Relevance > b.Relevance
+ })
- return deduped
}
Committable suggestion skipped: line range outside the PR's diff.
Embeddings work well when individual sentences are indexed instead of entire documents. Currently, the embedding provider will trim at 1k (I think) characters before embedding, which can lose important context and content when indexing. Eventually, the goal is to split large content pieces into either sentences or, worst case, spans of text bounded by spaces or at pre-set lengths (if no sentence boundaries or spaces can be found) This commit includes a basic algorithm which operates on `datagraph.Content` to produce a slice of strings representing (roughly) each paragraph of the text. Some extra logic ensures edge cases such as long spans of contiguous characters or long sentences with no punctuation are handled. The rest of this PR will focus on using the chunked text spans in indexing which will require some internal changes to how vectoring works. Currently, there's an assumption that resources map 1:1 to vector index documents. After this change, this will not be the case as resources will map 1:n where n is the number of text chunks from the original content.
b18f511
to
0536943
Compare
Embeddings work well when individual sentences are indexed instead of entire documents. Currently, the embedding provider will trim at 1k (I think) characters before embedding, which can lose important context and content when indexing.
Eventually, the goal is to split large content pieces into either sentences or, worst case, spans of text bounded by spaces or at pre-set lengths (if no sentence boundaries or spaces can be found)
This commit includes a basic algorithm which operates on
datagraph.Content
to produce a slice of strings representing (roughly) each paragraph of the text. Some extra logic ensures edge cases such as long spans of contiguous characters or long sentences with no punctuation are handled.The rest of this PR will focus on using the chunked text spans in indexing which will require some internal changes to how vectoring works. Currently, there's an assumption that resources map 1:1 to vector index documents. After this change, this will not be the case as resources will map 1:n where n is the number of text chunks from the original content.