use content chunking for indexed items #318

Southclaws · 2024-12-15T20:14:18Z

Embeddings work well when individual sentences are indexed instead of entire documents. Currently, the embedding provider will trim at 1k (I think) characters before embedding, which can lose important context and content when indexing.

Eventually, the goal is to split large content pieces into either sentences or, worst case, spans of text bounded by spaces or at pre-set lengths (if no sentence boundaries or spaces can be found)

This commit includes a basic algorithm which operates on datagraph.Content to produce a slice of strings representing (roughly) each paragraph of the text. Some extra logic ensures edge cases such as long spans of contiguous characters or long sentences with no punctuation are handled.

The rest of this PR will focus on using the chunked text spans in indexing which will require some internal changes to how vectoring works. Currently, there's an assumption that resources map 1:1 to vector index documents. After this change, this will not be the case as resources will map 1:n where n is the number of text chunks from the original content.

vercel · 2024-12-15T20:14:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Skipped Deployments

Name	Status	Preview	Comments	Updated (UTC)
storyden	⬜️ Ignored (Inspect)	Visit Preview		Dec 19, 2024 6:46pm
storyden-homepage	⬜️ Ignored (Inspect)	Visit Preview		Dec 19, 2024 6:46pm

coderabbitai · 2024-12-15T20:14:26Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

The changes introduce new functionality to the Content type in the datagraph package for handling HTML content. A new Split() method is implemented to break down HTML content into manageable text chunks. The implementation includes a constant roughMaxSentenceSize to control chunk sizes, along with helper functions chunksFromNodes(), splitearly(), and textfromnode() to process and split HTML nodes effectively. Corresponding test cases are added to verify the new splitting mechanism's behavior with both simple and complex HTML inputs. Additionally, significant modifications are made to remove relevance scoring functionalities from various services, streamlining their implementations.

Changes

File	Change Summary
`app/resources/datagraph/content.go`	- Added constant `roughMaxSentenceSize` (350) - Implemented `Split()` method for `Content` - Added helper functions: `chunksFromNodes()`, `splitearly()`, `textfromnode()`
`app/resources/datagraph/content_test.go`	- Added `TestSplit()` test function - Added `TestSplitMinimal()` test function - Added `TestSplitLong()` test function
`app/services/collection/collection_read/reader.go`	- Removed `semdex` field from `Hydrator` struct - Updated `New` constructor to remove `semdex` parameter
`app/services/library/node_read/reader.go`	- Removed `scorer` field from `HydratedQuerier` struct - Updated `New` constructor to remove `scorer` parameter
`app/services/semdex/semdex.go`	- Removed `RelevanceScorer` interface from `Querier`
`app/services/semdex/semdexer/semdexer.go`	- Removed `fx.As(new(semdex.RelevanceScorer))` from `Build` function
`app/services/semdex/semdexer/weaviate_semdexer/delete.go`	- Updated `Delete` method to use batch deletion
`app/services/semdex/semdexer/weaviate_semdexer/indexer.go`	- Modified `Index` method to process content as chunks - Updated `indexChunk` method for chunk ID handling
`app/services/semdex/semdexer/weaviate_semdexer/mapping.go`	- Added functions for chunk ID generation and object ID handling
`app/services/semdex/semdexer/weaviate_semdexer/recommend.go`	- Updated `RecommendRefs` method to use chunk IDs
`app/services/semdex/semdexer/weaviate_semdexer/relevance.go`	- Removed `ScoreRelevance` method
`app/services/semdex/semdexer/weaviate_semdexer/search.go`	- Added `dedupeChunks` function for deduplication in `SearchRefs`
`internal/infrastructure/weaviate/weaviate.go`	- Updated Weaviate client configuration and error handling
`app/resources/datagraph/ref.go`	- Added methods to implement `sort.Interface` for `RefList`

Poem

🐰 Splitting HTML, chunk by chunk,
With rabbit-like precision and spunk!
Sentences dance, paragraphs flow,
Breaking barriers, watch content grow!
A coding adventure, swift and bright! 🔍

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b21701a and 0536943.

📒 Files selected for processing (15)

app/resources/datagraph/content.go (1 hunks)
app/resources/datagraph/content_test.go (1 hunks)
app/resources/datagraph/ref.go (1 hunks)
app/services/collection/collection_read/reader.go (0 hunks)
app/services/library/node_read/reader.go (0 hunks)
app/services/semdex/semdex.go (0 hunks)
app/services/semdex/semdexer/semdexer.go (0 hunks)
app/services/semdex/semdexer/weaviate_semdexer/delete.go (1 hunks)
app/services/semdex/semdexer/weaviate_semdexer/indexer.go (3 hunks)
app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go (0 hunks)
app/services/semdex/semdexer/weaviate_semdexer/mapping.go (3 hunks)
app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2 hunks)
app/services/semdex/semdexer/weaviate_semdexer/relevance.go (0 hunks)
app/services/semdex/semdexer/weaviate_semdexer/search.go (3 hunks)
internal/infrastructure/weaviate/weaviate.go (1 hunks)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

app/resources/datagraph/content.go (4)

305-306: Add documentation for the constant.

Consider adding a comment explaining the rationale behind the 350-character limit and any research/empirical data that supports this choice.

350-350: Remove or implement the TODO comment.

The TODO comment suggests incomplete implementation, but the code below appears to be implemented. Either remove the comment or clarify what additional splitting logic is needed.

360-410: Consider renaming the function for clarity.

The function name splitearly is not immediately clear about its purpose. Consider a more descriptive name like splitTextAtBoundaries or splitTextWithFallback.

412-425: LGTM! Consider adding documentation.

The implementation is efficient using strings.Builder. Consider adding documentation to explain the recursive approach for future maintainers.

app/resources/datagraph/content_test.go (1)

129-149: Add test cases for edge cases.

While the basic test is good, consider adding test cases for:

Empty content

Content with only non-text nodes

Content with nested block elements

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 130e1b7 and f777583.

📒 Files selected for processing (2)

app/resources/datagraph/content.go (1 hunks)
app/resources/datagraph/content_test.go (1 hunks)

🔇 Additional comments (1)

app/resources/datagraph/content_test.go (1)

151-181: LGTM! Comprehensive test for long content.

The test thoroughly verifies the splitting behavior with a complex HTML structure and various content types.

coderabbitai · 2024-12-15T20:22:09Z

app/resources/datagraph/content.go

+func (c Content) Split() []string {
+	r := []html.Node{}
+
+	// first, walk the tree for the top-most block-content nodes.
+	var walk func(n *html.Node)
+	walk = func(n *html.Node) {
+		if n.Type == html.ElementNode {
+			switch n.DataAtom {
+			case
+				atom.H1,
+				atom.H2,
+				atom.H3,
+				atom.H4,
+				atom.H5,
+				atom.H6,
+				atom.Blockquote,
+				atom.Pre,
+				atom.P:
+				r = append(r, *n)
+				// once split, exit out of this branch
+				return
+			}
+		}
+
+		for c := n.FirstChild; c != nil; c = c.NextSibling {
+			walk(c)
+		}
+	}
+	walk(c.html)
+
+	// now, iterate these top level nodes and split any that are "too big"
+	chunks := chunksFromNodes(r, roughMaxSentenceSize)
+
+	return chunks
+}


⚠️ Potential issue

Add nil check for HTML tree.

The Split() method should handle the case where c.html is nil to prevent potential panic.

func (c Content) Split() []string { + if c.html == nil { + return nil + } r := []html.Node{}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (c Content) Split() []string {

r := []html.Node{}

// first, walk the tree for the top-most block-content nodes.

var walk func(n *html.Node)

walk = func(n *html.Node) {

if n.Type == html.ElementNode {

switch n.DataAtom {

case

atom.H1,

atom.H2,

atom.H3,

atom.H4,

atom.H5,

atom.H6,

atom.Blockquote,

atom.Pre,

atom.P:

r = append(r, *n)

// once split, exit out of this branch

return

}

}

for c := n.FirstChild; c != nil; c = c.NextSibling {

walk(c)

}

}

walk(c.html)

// now, iterate these top level nodes and split any that are "too big"

chunks := chunksFromNodes(r, roughMaxSentenceSize)

return chunks

}

func (c Content) Split() []string {

if c.html == nil {

return nil

}

r := []html.Node{}

// first, walk the tree for the top-most block-content nodes.

var walk func(n *html.Node)

walk = func(n *html.Node) {

if n.Type == html.ElementNode {

switch n.DataAtom {

case

atom.H1,

atom.H2,

atom.H3,

atom.H4,

atom.H5,

atom.H6,

atom.Blockquote,

atom.Pre,

atom.P:

r = append(r, *n)

// once split, exit out of this branch

return

}

}

for c := n.FirstChild; c != nil; c = c.NextSibling {

walk(c)

}

}

walk(c.html)

// now, iterate these top level nodes and split any that are "too big"

chunks := chunksFromNodes(r, roughMaxSentenceSize)

return chunks

}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (8)

app/services/semdex/semdexer/weaviate_semdexer/indexer.go (1)

33-38: indexChunk method consistency
The logic to derive the chunkID and check existence is well-structured. Just ensure that the chunkID generation remains consistent across the entire system (e.g., in delete flows and other references).

app/resources/datagraph/content.go (1)

353-367: chunksFromNodes
This function elegantly handles large text by delegating to splitearly(). One suggestion is to handle extremely large strings with care for potential memory usage, but for typical use cases, this is fine.

app/services/semdex/semdexer/weaviate_semdexer/recommend.go (1)

37-43: Single-chunk selection
Selecting just chunkIDs[0] is a short-term solution for demonstration. If you plan to handle multiple chunks, consider how to combine or average vectors for better recommendations.
app/services/semdex/semdexer/weaviate_semdexer/mapping.go (2)
99-105: Add return value documentation

Consider adding a comment explaining the deterministic nature of the generated UUID and its uniqueness guarantees.
 func generateChunkID(id xid.ID, chunk string) uuid.UUID {
+    // Returns a deterministic UUID v3 generated from the combination of object ID and chunk content.
+    // The UUID is guaranteed to be unique for each unique (id, chunk) pair.
     // We don't currently support sharing chunks across content nodes, so append
     // the object's ID to the chunk's hash, to ensure it's unique to the object.
     payload := []byte(append(id.Bytes(), chunk...))
107-115: Reduce code duplication with generateChunkID

The UUID generation logic is duplicated. Consider refactoring to reuse the existing generateChunkID function.
 func chunkIDsFor(id xid.ID) func(chunk string) uuid.UUID {
     return func(chunk string) uuid.UUID {
-        // We don't currently support sharing chunks across content nodes, so append
-        // the object's ID to the chunk's hash, to ensure it's unique to the object.
-        payload := []byte(append(id.Bytes(), chunk...))
-
-        return uuid.NewHash(fnv.New128(), uuid.NameSpaceOID, payload, 4)
+        return generateChunkID(id, chunk)
     }
 }
app/resources/datagraph/content_test.go (3)
129-149: Enhance test coverage with specific assertions

The test only verifies the number of segments but doesn't validate their content. Consider adding assertions for the expected content of each segment.
     ps := c.Split()
     a.Len(ps, 5)
+    a.Equal("Here's a paragraph. It's pretty neat.", ps[0])
+    a.Equal("Here's the rest of the text.", ps[1])
+    a.Equal("neat photo right?", ps[2])
+    a.Equal("This is quite a long post, the summary, should just be the first 128 characters rounded down to the nearest space.", ps[3])
151-163: Add content verification to minimal test case

The test verifies the length but not the content. Consider adding an assertion to verify the correct handling of HTML entities.
     ps := c.Split()
     a.Len(ps, 1)
+    a.Equal("I've tried everything, but for some reason it seems impossible to find datasets that includes a simple list of the councils in England sorted by their group, and a list of covid cases also sorted by councils. I'm not British so it may be a lack of knowledge of how their government sites work. Anyone know a place to find these?", ps[0])
165-195: Consider using neutral test data

While the test coverage is thorough, the test data contains potentially controversial political content. Consider using more neutral Lorem Ipsum text or technical documentation for test cases.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f777583 and b21701a.

📒 Files selected for processing (14)

app/resources/datagraph/content.go (1 hunks)
app/resources/datagraph/content_test.go (1 hunks)
app/services/collection/collection_read/reader.go (0 hunks)
app/services/library/node_read/reader.go (0 hunks)
app/services/semdex/semdex.go (0 hunks)
app/services/semdex/semdexer/semdexer.go (0 hunks)
app/services/semdex/semdexer/weaviate_semdexer/delete.go (1 hunks)
app/services/semdex/semdexer/weaviate_semdexer/indexer.go (3 hunks)
app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go (0 hunks)
app/services/semdex/semdexer/weaviate_semdexer/mapping.go (3 hunks)
app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2 hunks)
app/services/semdex/semdexer/weaviate_semdexer/relevance.go (0 hunks)
app/services/semdex/semdexer/weaviate_semdexer/search.go (3 hunks)
internal/infrastructure/weaviate/weaviate.go (1 hunks)

💤 Files with no reviewable changes (6)

app/services/semdex/semdexer/semdexer.go
app/services/semdex/semdex.go
app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go
app/services/library/node_read/reader.go
app/services/semdex/semdexer/weaviate_semdexer/relevance.go
app/services/collection/collection_read/reader.go

🔇 Additional comments (18)

app/services/semdex/semdexer/weaviate_semdexer/indexer.go (7)

17-21: Ensure content is non-empty before indexing
The code checks for empty chunks. This is good for avoiding unnecessary indexing. Consider adding logs or metrics to monitor how often empty content occurs if you need deeper visibility.

23-28: Indexing each chunk
Looping through each chunk is a solid approach to split content for fine-grained indexing. Ensure downstream components handle multiple documents per resource properly, as the PR mentioned a shift to multiple vector index documents per resource.

30-31: Early return on success
Returning nil after successful indexing is concise and clear. This approach is straightforward and makes the control flow easy to follow.

43-47: Properties map
Using a map for dynamic properties is flexible. However, you might want to consider validating "content" lengths or trimming it if there's a size limit in Weaviate.

50-51: Avoid redundant updates
You have a check to compare existing properties before updating, which avoids unnecessary writes. This is a good performance optimization.

60-66: Creator vs. Updater
The approach of using Updater() when exists is true, and Creator() otherwise, is correct. Confirm that partial property updates in Weaviate do not overwrite any non-supplied properties.

85-108: existsByContent function
The function effectively encapsulates the retrieval logic to check if an object already exists. Good approach! One recommendation is to add more robust error checks for non-404 status codes if you need granular error handling.

app/resources/datagraph/content.go (4)

304-307: Constant for maximum sentence size
Defining a clear constant (roughMaxSentenceSize = 350) is helpful. This is a reasonable upper bound for typical sentences.

308-351: Consider nil check on c.html
Similar to a previous comment, a nil check would prevent potential panics if c.html is unexpectedly nil.

369-419: splitearly splitting logic
The recursive approach is effective for chunking text at sentence boundaries, falling back to spaces. Consider more robust boundary checks or language-specific libraries if you need better sentence detection in future improvements.

421-434: textfromnode extraction
Straightforward text extraction. If performance becomes an issue, you might look into specialized HTML text extraction libraries. For now, this serves well.

app/services/semdex/semdexer/weaviate_semdexer/delete.go (1)

13-20: Batch deletion approach
Utilizing the batch deleter with a filter on "datagraph_id" is more flexible than single-object deletion. This also sets the foundation for multi-chunk deletions if needed.

app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2)

32-35: Handle empty chunk IDs
Returning nil early when chunkIDs is empty helps prevent unnecessary queries. This is a clean, safe check.

55-55: Increased certainty
Raising the certainty threshold to 0.7 filters out less-confident recommendations. Make sure this threshold aligns with your desired user experience or business requirements.

app/services/semdex/semdexer/weaviate_semdexer/mapping.go (2)

Line range hint 61-66: LGTM: Score parsing enhancement

The additional condition properly handles cases where the score is provided as a string instead of distance, with appropriate error handling.

121-127: LGTM: Efficient slice conversion

The function correctly pre-allocates the result slice and performs the conversion efficiently.

app/services/semdex/semdexer/weaviate_semdexer/search.go (2)

10-11: LGTM: Appropriate imports added for the new functionality

The new imports (xid for ID handling and lo for functional operations) align well with the implementation requirements.

110-112: Verify pagination behavior with deduplication

The deduplication step is correctly placed before pagination. However, we should verify that the total count used for pagination remains accurate after deduplication.

coderabbitai · 2024-12-18T21:09:56Z

internal/infrastructure/weaviate/weaviate.go

+					Model:      "text-embedding-3-small",
 					Dimensions: "3072",
 					Type:       "text",
 				},


💡 Codebase verification

⚠️ Model configuration mismatch detected

There appears to be an inconsistency in the embedding model configuration across the codebase:

internal/infrastructure/weaviate/weaviate.go uses text-embedding-3-small

app/services/semdex/semdexer/chromem_semdexer/chromem.go uses EmbeddingModelOpenAI3Large

This mismatch could lead to inconsistent embedding dimensions and potential compatibility issues between different parts of the system.

Consider aligning the embedding model choice across all components to ensure consistent behavior

Update the chromem_semdexer to use the same model as Weaviate, or document why different models are needed

🔗 Analysis chain

LGTM! Verify OpenAI model compatibility.

The switch to text-embedding-3-small is a cost-effective choice that aligns well with the PR's content splitting strategy. The dimensions parameter correctly matches OpenAI's specifications.

Let's verify the model configuration:

Consider adding a comment explaining the rationale for using the small model, such as:

ModuleConfig: map[string]ModuleConfig{ "text2vec-openai": { + // Using the small model for better cost-efficiency while maintaining + // good performance with chunked content Model: "text-embedding-3-small", Dimensions: "3072", Type: "text", },

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify OpenAI model configuration across the codebase # Check for any hardcoded references to the old model rg "text-embedding-3-large" # Check for any configuration files that might override these settings fd -e yaml -e json -e toml | xargs rg "text-embedding-3"

Length of output: 84

Script:

#!/bin/bash # Let's try a broader search for model configurations and OpenAI-related settings # Search for any OpenAI model configurations rg -i "openai.*model|model.*openai" -A 3 -B 3 # Search for embedding-related configurations rg -i "embedding.*model|model.*embedding" -A 3 -B 3 # Search for dimension configurations that might be related rg "Dimensions.*3072|3072.*dimension" -A 3 -B 3 # Look for configuration files specifically fd "config|settings" -e yaml -e json -e toml | xargs cat

Length of output: 5424

coderabbitai · 2024-12-18T21:09:56Z

app/services/semdex/semdexer/weaviate_semdexer/search.go

+func dedupeChunks(results []*datagraph.Ref) []*datagraph.Ref {
+	groupedByID := lo.GroupBy(results, func(r *datagraph.Ref) xid.ID { return r.ID })
+
+	// for each grouped result, compute the average score and flatten
+	// the list of results into a single result per ID
+	// this is a naive approach to deduplication
+
+	list := lo.Values(groupedByID)
+
+	deduped := dt.Reduce(list, func(acc []*datagraph.Ref, curr []*datagraph.Ref) []*datagraph.Ref {
+		first := curr[0]
+		score := 0.0
+
+		for _, r := range curr {
+			score += r.Relevance
+		}
+
+		next := &datagraph.Ref{
+			ID:        first.ID,
+			Kind:      first.Kind,
+			Relevance: score / float64(len(curr)),
+		}
+
+		return append(acc, next)
+	}, []*datagraph.Ref{})
+
+	return deduped
+}


⚠️ Potential issue

Critical issues in deduplication implementation

The deduplication logic has several potential issues:

It assumes all chunks from the same ID have identical Kind

It drops potentially important fields from the Ref struct

Results might need re-sorting by relevance after averaging

No explicit handling of empty input

Consider this improved implementation:

func dedupeChunks(results []*datagraph.Ref) []*datagraph.Ref { + if len(results) == 0 { + return results + } + groupedByID := lo.GroupBy(results, func(r *datagraph.Ref) xid.ID { return r.ID }) - list := lo.Values(groupedByID) - - deduped := dt.Reduce(list, func(acc []*datagraph.Ref, curr []*datagraph.Ref) []*datagraph.Ref { - first := curr[0] - score := 0.0 - - for _, r := range curr { - score += r.Relevance - } - - next := &datagraph.Ref{ - ID: first.ID, - Kind: first.Kind, - Relevance: score / float64(len(curr)), - } - - return append(acc, next) - }, []*datagraph.Ref{}) + deduped := make([]*datagraph.Ref, 0, len(groupedByID)) + + for _, chunks := range groupedByID { + if len(chunks) == 0 { + continue + } + + // Verify all chunks have the same Kind + kind := chunks[0].Kind + for _, c := range chunks[1:] { + if c.Kind != kind { + return nil, fault.New("inconsistent kinds for same ID") + } + } + + // Calculate average relevance + totalScore := 0.0 + for _, c := range chunks { + totalScore += c.Relevance + } + + // Copy all fields from the most relevant chunk + mostRelevant := lo.MaxBy(chunks, func(a, b *datagraph.Ref) bool { + return a.Relevance > b.Relevance + }) + + merged := *mostRelevant // Copy all fields + merged.Relevance = totalScore / float64(len(chunks)) + + deduped = append(deduped, &merged) + } + + // Re-sort by relevance + return lo.SortBy(deduped, func(a, b *datagraph.Ref) bool { + return a.Relevance > b.Relevance + }) - return deduped }

Committable suggestion skipped: line range outside the PR's diff.

Embeddings work well when individual sentences are indexed instead of entire documents. Currently, the embedding provider will trim at 1k (I think) characters before embedding, which can lose important context and content when indexing. Eventually, the goal is to split large content pieces into either sentences or, worst case, spans of text bounded by spaces or at pre-set lengths (if no sentence boundaries or spaces can be found) This commit includes a basic algorithm which operates on `datagraph.Content` to produce a slice of strings representing (roughly) each paragraph of the text. Some extra logic ensures edge cases such as long spans of contiguous characters or long sentences with no punctuation are handled. The rest of this PR will focus on using the chunked text spans in indexing which will require some internal changes to how vectoring works. Currently, there's an assumption that resources map 1:1 to vector index documents. After this change, this will not be the case as resources will map 1:n where n is the number of text chunks from the original content.

coderabbitai bot reviewed Dec 15, 2024

View reviewed changes

Southclaws changed the title ~~add a basic algo to content for splitting~~ use content chunking for indexed items Dec 18, 2024

coderabbitai bot reviewed Dec 18, 2024

View reviewed changes

barneyferry and others added 6 commits December 19, 2024 18:42

add support for chunking bare html objects

af39bf5

remove usage of relevance scorer - needs more thinking at a later date

3a2f3d3

rework weaviate semdexer to work with chunked content

cf287cc

fix wrong model name

7e97ac9

ensure deduped results are sorted and filter out irrelevant results

0536943

Southclaws force-pushed the chunked-embeddings branch from b18f511 to 0536943 Compare December 19, 2024 18:42

Southclaws merged commit 6bde4e2 into main Dec 19, 2024
2 of 4 checks passed

Southclaws deleted the chunked-embeddings branch December 19, 2024 18:43

This was referenced Dec 27, 2024

rework middleware and add frontend proxy #331

Merged

add pinecone semdexer #340

Merged

rework reindexing and update pinecone dedupe #341

Merged

coderabbitai bot mentioned this pull request Jan 4, 2025

improve thread read performance #348

Merged

coderabbitai bot mentioned this pull request Jan 14, 2025

wip #357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use content chunking for indexed items #318

use content chunking for indexed items #318

Southclaws commented Dec 15, 2024

vercel bot commented Dec 15, 2024 •

edited

Loading

coderabbitai bot commented Dec 15, 2024 •

edited

Loading

Review failed

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 15, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 18, 2024

coderabbitai bot Dec 18, 2024

use content chunking for indexed items #318

use content chunking for indexed items #318

Conversation

Southclaws commented Dec 15, 2024

vercel bot commented Dec 15, 2024 • edited Loading

coderabbitai bot commented Dec 15, 2024 • edited Loading

Review failed

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 15, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 18, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 18, 2024

Choose a reason for hiding this comment

vercel bot commented Dec 15, 2024 •

edited

Loading

coderabbitai bot commented Dec 15, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)