Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use content chunking for indexed items #318

Merged
merged 6 commits into from
Dec 19, 2024
Merged

use content chunking for indexed items #318

merged 6 commits into from
Dec 19, 2024

Conversation

Southclaws
Copy link
Owner

Embeddings work well when individual sentences are indexed instead of entire documents. Currently, the embedding provider will trim at 1k (I think) characters before embedding, which can lose important context and content when indexing.

Eventually, the goal is to split large content pieces into either sentences or, worst case, spans of text bounded by spaces or at pre-set lengths (if no sentence boundaries or spaces can be found)

This commit includes a basic algorithm which operates on datagraph.Content to produce a slice of strings representing (roughly) each paragraph of the text. Some extra logic ensures edge cases such as long spans of contiguous characters or long sentences with no punctuation are handled.

The rest of this PR will focus on using the chunked text spans in indexing which will require some internal changes to how vectoring works. Currently, there's an assumption that resources map 1:1 to vector index documents. After this change, this will not be the case as resources will map 1:n where n is the number of text chunks from the original content.

Copy link

vercel bot commented Dec 15, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
storyden ⬜️ Ignored (Inspect) Visit Preview Dec 19, 2024 6:46pm
storyden-homepage ⬜️ Ignored (Inspect) Visit Preview Dec 19, 2024 6:46pm

Copy link

coderabbitai bot commented Dec 15, 2024

Caution

Review failed

The pull request is closed.

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes introduce new functionality to the Content type in the datagraph package for handling HTML content. A new Split() method is implemented to break down HTML content into manageable text chunks. The implementation includes a constant roughMaxSentenceSize to control chunk sizes, along with helper functions chunksFromNodes(), splitearly(), and textfromnode() to process and split HTML nodes effectively. Corresponding test cases are added to verify the new splitting mechanism's behavior with both simple and complex HTML inputs. Additionally, significant modifications are made to remove relevance scoring functionalities from various services, streamlining their implementations.

Changes

File Change Summary
app/resources/datagraph/content.go - Added constant roughMaxSentenceSize (350)
- Implemented Split() method for Content
- Added helper functions: chunksFromNodes(), splitearly(), textfromnode()
app/resources/datagraph/content_test.go - Added TestSplit() test function
- Added TestSplitMinimal() test function
- Added TestSplitLong() test function
app/services/collection/collection_read/reader.go - Removed semdex field from Hydrator struct
- Updated New constructor to remove semdex parameter
app/services/library/node_read/reader.go - Removed scorer field from HydratedQuerier struct
- Updated New constructor to remove scorer parameter
app/services/semdex/semdex.go - Removed RelevanceScorer interface from Querier
app/services/semdex/semdexer/semdexer.go - Removed fx.As(new(semdex.RelevanceScorer)) from Build function
app/services/semdex/semdexer/weaviate_semdexer/delete.go - Updated Delete method to use batch deletion
app/services/semdex/semdexer/weaviate_semdexer/indexer.go - Modified Index method to process content as chunks
- Updated indexChunk method for chunk ID handling
app/services/semdex/semdexer/weaviate_semdexer/mapping.go - Added functions for chunk ID generation and object ID handling
app/services/semdex/semdexer/weaviate_semdexer/recommend.go - Updated RecommendRefs method to use chunk IDs
app/services/semdex/semdexer/weaviate_semdexer/relevance.go - Removed ScoreRelevance method
app/services/semdex/semdexer/weaviate_semdexer/search.go - Added dedupeChunks function for deduplication in SearchRefs
internal/infrastructure/weaviate/weaviate.go - Updated Weaviate client configuration and error handling
app/resources/datagraph/ref.go - Added methods to implement sort.Interface for RefList

Poem

🐰 Splitting HTML, chunk by chunk,
With rabbit-like precision and spunk!
Sentences dance, paragraphs flow,
Breaking barriers, watch content grow!
A coding adventure, swift and bright! 🔍

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b21701a and 0536943.

📒 Files selected for processing (15)
  • app/resources/datagraph/content.go (1 hunks)
  • app/resources/datagraph/content_test.go (1 hunks)
  • app/resources/datagraph/ref.go (1 hunks)
  • app/services/collection/collection_read/reader.go (0 hunks)
  • app/services/library/node_read/reader.go (0 hunks)
  • app/services/semdex/semdex.go (0 hunks)
  • app/services/semdex/semdexer/semdexer.go (0 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/delete.go (1 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/indexer.go (3 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go (0 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/mapping.go (3 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/relevance.go (0 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/search.go (3 hunks)
  • internal/infrastructure/weaviate/weaviate.go (1 hunks)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
app/resources/datagraph/content.go (4)

305-306: Add documentation for the constant.

Consider adding a comment explaining the rationale behind the 350-character limit and any research/empirical data that supports this choice.


350-350: Remove or implement the TODO comment.

The TODO comment suggests incomplete implementation, but the code below appears to be implemented. Either remove the comment or clarify what additional splitting logic is needed.


360-410: Consider renaming the function for clarity.

The function name splitearly is not immediately clear about its purpose. Consider a more descriptive name like splitTextAtBoundaries or splitTextWithFallback.


412-425: LGTM! Consider adding documentation.

The implementation is efficient using strings.Builder. Consider adding documentation to explain the recursive approach for future maintainers.

app/resources/datagraph/content_test.go (1)

129-149: Add test cases for edge cases.

While the basic test is good, consider adding test cases for:

  • Empty content
  • Content with only non-text nodes
  • Content with nested block elements
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 130e1b7 and f777583.

📒 Files selected for processing (2)
  • app/resources/datagraph/content.go (1 hunks)
  • app/resources/datagraph/content_test.go (1 hunks)
🔇 Additional comments (1)
app/resources/datagraph/content_test.go (1)

151-181: LGTM! Comprehensive test for long content.

The test thoroughly verifies the splitting behavior with a complex HTML structure and various content types.

Comment on lines 308 to 351
func (c Content) Split() []string {
r := []html.Node{}

// first, walk the tree for the top-most block-content nodes.
var walk func(n *html.Node)
walk = func(n *html.Node) {
if n.Type == html.ElementNode {
switch n.DataAtom {
case
atom.H1,
atom.H2,
atom.H3,
atom.H4,
atom.H5,
atom.H6,
atom.Blockquote,
atom.Pre,
atom.P:
r = append(r, *n)
// once split, exit out of this branch
return
}
}

for c := n.FirstChild; c != nil; c = c.NextSibling {
walk(c)
}
}
walk(c.html)

// now, iterate these top level nodes and split any that are "too big"
chunks := chunksFromNodes(r, roughMaxSentenceSize)

return chunks
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add nil check for HTML tree.

The Split() method should handle the case where c.html is nil to prevent potential panic.

 func (c Content) Split() []string {
+    if c.html == nil {
+        return nil
+    }
     r := []html.Node{}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c Content) Split() []string {
r := []html.Node{}
// first, walk the tree for the top-most block-content nodes.
var walk func(n *html.Node)
walk = func(n *html.Node) {
if n.Type == html.ElementNode {
switch n.DataAtom {
case
atom.H1,
atom.H2,
atom.H3,
atom.H4,
atom.H5,
atom.H6,
atom.Blockquote,
atom.Pre,
atom.P:
r = append(r, *n)
// once split, exit out of this branch
return
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
walk(c)
}
}
walk(c.html)
// now, iterate these top level nodes and split any that are "too big"
chunks := chunksFromNodes(r, roughMaxSentenceSize)
return chunks
}
func (c Content) Split() []string {
if c.html == nil {
return nil
}
r := []html.Node{}
// first, walk the tree for the top-most block-content nodes.
var walk func(n *html.Node)
walk = func(n *html.Node) {
if n.Type == html.ElementNode {
switch n.DataAtom {
case
atom.H1,
atom.H2,
atom.H3,
atom.H4,
atom.H5,
atom.H6,
atom.Blockquote,
atom.Pre,
atom.P:
r = append(r, *n)
// once split, exit out of this branch
return
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
walk(c)
}
}
walk(c.html)
// now, iterate these top level nodes and split any that are "too big"
chunks := chunksFromNodes(r, roughMaxSentenceSize)
return chunks
}

@Southclaws Southclaws changed the title add a basic algo to content for splitting use content chunking for indexed items Dec 18, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
app/services/semdex/semdexer/weaviate_semdexer/indexer.go (1)

33-38: indexChunk method consistency
The logic to derive the chunkID and check existence is well-structured. Just ensure that the chunkID generation remains consistent across the entire system (e.g., in delete flows and other references).

app/resources/datagraph/content.go (1)

353-367: chunksFromNodes
This function elegantly handles large text by delegating to splitearly(). One suggestion is to handle extremely large strings with care for potential memory usage, but for typical use cases, this is fine.

app/services/semdex/semdexer/weaviate_semdexer/recommend.go (1)

37-43: Single-chunk selection
Selecting just chunkIDs[0] is a short-term solution for demonstration. If you plan to handle multiple chunks, consider how to combine or average vectors for better recommendations.

app/services/semdex/semdexer/weaviate_semdexer/mapping.go (2)

99-105: Add return value documentation

Consider adding a comment explaining the deterministic nature of the generated UUID and its uniqueness guarantees.

 func generateChunkID(id xid.ID, chunk string) uuid.UUID {
+    // Returns a deterministic UUID v3 generated from the combination of object ID and chunk content.
+    // The UUID is guaranteed to be unique for each unique (id, chunk) pair.
     // We don't currently support sharing chunks across content nodes, so append
     // the object's ID to the chunk's hash, to ensure it's unique to the object.
     payload := []byte(append(id.Bytes(), chunk...))

107-115: Reduce code duplication with generateChunkID

The UUID generation logic is duplicated. Consider refactoring to reuse the existing generateChunkID function.

 func chunkIDsFor(id xid.ID) func(chunk string) uuid.UUID {
     return func(chunk string) uuid.UUID {
-        // We don't currently support sharing chunks across content nodes, so append
-        // the object's ID to the chunk's hash, to ensure it's unique to the object.
-        payload := []byte(append(id.Bytes(), chunk...))
-
-        return uuid.NewHash(fnv.New128(), uuid.NameSpaceOID, payload, 4)
+        return generateChunkID(id, chunk)
     }
 }
app/resources/datagraph/content_test.go (3)

129-149: Enhance test coverage with specific assertions

The test only verifies the number of segments but doesn't validate their content. Consider adding assertions for the expected content of each segment.

     ps := c.Split()
     a.Len(ps, 5)
+    a.Equal("Here's a paragraph. It's pretty neat.", ps[0])
+    a.Equal("Here's the rest of the text.", ps[1])
+    a.Equal("neat photo right?", ps[2])
+    a.Equal("This is quite a long post, the summary, should just be the first 128 characters rounded down to the nearest space.", ps[3])

151-163: Add content verification to minimal test case

The test verifies the length but not the content. Consider adding an assertion to verify the correct handling of HTML entities.

     ps := c.Split()
     a.Len(ps, 1)
+    a.Equal("I've tried everything, but for some reason it seems impossible to find datasets that includes a simple list of the councils in England sorted by their group, and a list of covid cases also sorted by councils. I'm not British so it may be a lack of knowledge of how their government sites work. Anyone know a place to find these?", ps[0])

165-195: Consider using neutral test data

While the test coverage is thorough, the test data contains potentially controversial political content. Consider using more neutral Lorem Ipsum text or technical documentation for test cases.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f777583 and b21701a.

📒 Files selected for processing (14)
  • app/resources/datagraph/content.go (1 hunks)
  • app/resources/datagraph/content_test.go (1 hunks)
  • app/services/collection/collection_read/reader.go (0 hunks)
  • app/services/library/node_read/reader.go (0 hunks)
  • app/services/semdex/semdex.go (0 hunks)
  • app/services/semdex/semdexer/semdexer.go (0 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/delete.go (1 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/indexer.go (3 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go (0 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/mapping.go (3 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/relevance.go (0 hunks)
  • app/services/semdex/semdexer/weaviate_semdexer/search.go (3 hunks)
  • internal/infrastructure/weaviate/weaviate.go (1 hunks)
💤 Files with no reviewable changes (6)
  • app/services/semdex/semdexer/semdexer.go
  • app/services/semdex/semdex.go
  • app/services/semdex/semdexer/weaviate_semdexer/indexer_test.go
  • app/services/library/node_read/reader.go
  • app/services/semdex/semdexer/weaviate_semdexer/relevance.go
  • app/services/collection/collection_read/reader.go
🔇 Additional comments (18)
app/services/semdex/semdexer/weaviate_semdexer/indexer.go (7)

17-21: Ensure content is non-empty before indexing
The code checks for empty chunks. This is good for avoiding unnecessary indexing. Consider adding logs or metrics to monitor how often empty content occurs if you need deeper visibility.


23-28: Indexing each chunk
Looping through each chunk is a solid approach to split content for fine-grained indexing. Ensure downstream components handle multiple documents per resource properly, as the PR mentioned a shift to multiple vector index documents per resource.


30-31: Early return on success
Returning nil after successful indexing is concise and clear. This approach is straightforward and makes the control flow easy to follow.


43-47: Properties map
Using a map for dynamic properties is flexible. However, you might want to consider validating "content" lengths or trimming it if there's a size limit in Weaviate.


50-51: Avoid redundant updates
You have a check to compare existing properties before updating, which avoids unnecessary writes. This is a good performance optimization.


60-66: Creator vs. Updater
The approach of using Updater() when exists is true, and Creator() otherwise, is correct. Confirm that partial property updates in Weaviate do not overwrite any non-supplied properties.


85-108: existsByContent function
The function effectively encapsulates the retrieval logic to check if an object already exists. Good approach! One recommendation is to add more robust error checks for non-404 status codes if you need granular error handling.

app/resources/datagraph/content.go (4)

304-307: Constant for maximum sentence size
Defining a clear constant (roughMaxSentenceSize = 350) is helpful. This is a reasonable upper bound for typical sentences.


308-351: Consider nil check on c.html
Similar to a previous comment, a nil check would prevent potential panics if c.html is unexpectedly nil.


369-419: splitearly splitting logic
The recursive approach is effective for chunking text at sentence boundaries, falling back to spaces. Consider more robust boundary checks or language-specific libraries if you need better sentence detection in future improvements.


421-434: textfromnode extraction
Straightforward text extraction. If performance becomes an issue, you might look into specialized HTML text extraction libraries. For now, this serves well.

app/services/semdex/semdexer/weaviate_semdexer/delete.go (1)

13-20: Batch deletion approach
Utilizing the batch deleter with a filter on "datagraph_id" is more flexible than single-object deletion. This also sets the foundation for multi-chunk deletions if needed.

app/services/semdex/semdexer/weaviate_semdexer/recommend.go (2)

32-35: Handle empty chunk IDs
Returning nil early when chunkIDs is empty helps prevent unnecessary queries. This is a clean, safe check.


55-55: Increased certainty
Raising the certainty threshold to 0.7 filters out less-confident recommendations. Make sure this threshold aligns with your desired user experience or business requirements.

app/services/semdex/semdexer/weaviate_semdexer/mapping.go (2)

Line range hint 61-66: LGTM: Score parsing enhancement

The additional condition properly handles cases where the score is provided as a string instead of distance, with appropriate error handling.


121-127: LGTM: Efficient slice conversion

The function correctly pre-allocates the result slice and performs the conversion efficiently.

app/services/semdex/semdexer/weaviate_semdexer/search.go (2)

10-11: LGTM: Appropriate imports added for the new functionality

The new imports (xid for ID handling and lo for functional operations) align well with the implementation requirements.


110-112: Verify pagination behavior with deduplication

The deduplication step is correctly placed before pagination. However, we should verify that the total count used for pagination remains accurate after deduplication.

Comment on lines +104 to 107
Model: "text-embedding-3-small",
Dimensions: "3072",
Type: "text",
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Model configuration mismatch detected

There appears to be an inconsistency in the embedding model configuration across the codebase:

  • internal/infrastructure/weaviate/weaviate.go uses text-embedding-3-small
  • app/services/semdex/semdexer/chromem_semdexer/chromem.go uses EmbeddingModelOpenAI3Large

This mismatch could lead to inconsistent embedding dimensions and potential compatibility issues between different parts of the system.

  • Consider aligning the embedding model choice across all components to ensure consistent behavior
  • Update the chromem_semdexer to use the same model as Weaviate, or document why different models are needed
🔗 Analysis chain

LGTM! Verify OpenAI model compatibility.

The switch to text-embedding-3-small is a cost-effective choice that aligns well with the PR's content splitting strategy. The dimensions parameter correctly matches OpenAI's specifications.

Let's verify the model configuration:

Consider adding a comment explaining the rationale for using the small model, such as:

 ModuleConfig: map[string]ModuleConfig{
   "text2vec-openai": {
+    // Using the small model for better cost-efficiency while maintaining
+    // good performance with chunked content
     Model:      "text-embedding-3-small",
     Dimensions: "3072",
     Type:       "text",
   },
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify OpenAI model configuration across the codebase

# Check for any hardcoded references to the old model
rg "text-embedding-3-large"

# Check for any configuration files that might override these settings
fd -e yaml -e json -e toml | xargs rg "text-embedding-3"

Length of output: 84


Script:

#!/bin/bash
# Let's try a broader search for model configurations and OpenAI-related settings

# Search for any OpenAI model configurations
rg -i "openai.*model|model.*openai" -A 3 -B 3

# Search for embedding-related configurations
rg -i "embedding.*model|model.*embedding" -A 3 -B 3

# Search for dimension configurations that might be related
rg "Dimensions.*3072|3072.*dimension" -A 3 -B 3

# Look for configuration files specifically
fd "config|settings" -e yaml -e json -e toml | xargs cat

Length of output: 5424

Comment on lines 147 to 187
func dedupeChunks(results []*datagraph.Ref) []*datagraph.Ref {
groupedByID := lo.GroupBy(results, func(r *datagraph.Ref) xid.ID { return r.ID })

// for each grouped result, compute the average score and flatten
// the list of results into a single result per ID
// this is a naive approach to deduplication

list := lo.Values(groupedByID)

deduped := dt.Reduce(list, func(acc []*datagraph.Ref, curr []*datagraph.Ref) []*datagraph.Ref {
first := curr[0]
score := 0.0

for _, r := range curr {
score += r.Relevance
}

next := &datagraph.Ref{
ID: first.ID,
Kind: first.Kind,
Relevance: score / float64(len(curr)),
}

return append(acc, next)
}, []*datagraph.Ref{})

return deduped
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical issues in deduplication implementation

The deduplication logic has several potential issues:

  1. It assumes all chunks from the same ID have identical Kind
  2. It drops potentially important fields from the Ref struct
  3. Results might need re-sorting by relevance after averaging
  4. No explicit handling of empty input

Consider this improved implementation:

 func dedupeChunks(results []*datagraph.Ref) []*datagraph.Ref {
+    if len(results) == 0 {
+        return results
+    }
+
     groupedByID := lo.GroupBy(results, func(r *datagraph.Ref) xid.ID { return r.ID })
 
-    list := lo.Values(groupedByID)
-
-    deduped := dt.Reduce(list, func(acc []*datagraph.Ref, curr []*datagraph.Ref) []*datagraph.Ref {
-        first := curr[0]
-        score := 0.0
-
-        for _, r := range curr {
-            score += r.Relevance
-        }
-
-        next := &datagraph.Ref{
-            ID:        first.ID,
-            Kind:      first.Kind,
-            Relevance: score / float64(len(curr)),
-        }
-
-        return append(acc, next)
-    }, []*datagraph.Ref{})
+    deduped := make([]*datagraph.Ref, 0, len(groupedByID))
+    
+    for _, chunks := range groupedByID {
+        if len(chunks) == 0 {
+            continue
+        }
+        
+        // Verify all chunks have the same Kind
+        kind := chunks[0].Kind
+        for _, c := range chunks[1:] {
+            if c.Kind != kind {
+                return nil, fault.New("inconsistent kinds for same ID")
+            }
+        }
+        
+        // Calculate average relevance
+        totalScore := 0.0
+        for _, c := range chunks {
+            totalScore += c.Relevance
+        }
+        
+        // Copy all fields from the most relevant chunk
+        mostRelevant := lo.MaxBy(chunks, func(a, b *datagraph.Ref) bool {
+            return a.Relevance > b.Relevance
+        })
+        
+        merged := *mostRelevant // Copy all fields
+        merged.Relevance = totalScore / float64(len(chunks))
+        
+        deduped = append(deduped, &merged)
+    }
+    
+    // Re-sort by relevance
+    return lo.SortBy(deduped, func(a, b *datagraph.Ref) bool {
+        return a.Relevance > b.Relevance
+    })
 
-    return deduped
 }

Committable suggestion skipped: line range outside the PR's diff.

barneyferry and others added 6 commits December 19, 2024 18:42
Embeddings work well when individual sentences are indexed instead of entire documents. Currently, the embedding provider will trim at 1k (I think) characters before embedding, which can lose important context and content when indexing.

Eventually, the goal is to split large content pieces into either sentences or, worst case, spans of text bounded by spaces or at pre-set lengths (if no sentence boundaries or spaces can be found)

This commit includes a basic algorithm which operates on `datagraph.Content` to produce a slice of strings representing (roughly) each paragraph of the text. Some extra logic ensures edge cases such as long spans of contiguous characters or long sentences with no punctuation are handled.

The rest of this PR will focus on using the chunked text spans in indexing which will require some internal changes to how vectoring works. Currently, there's an assumption that resources map 1:1 to vector index documents. After this change, this will not be the case as resources will map 1:n where n is the number of text chunks from the original content.
@Southclaws Southclaws merged commit 6bde4e2 into main Dec 19, 2024
2 of 4 checks passed
@Southclaws Southclaws deleted the chunked-embeddings branch December 19, 2024 18:43
@coderabbitai coderabbitai bot mentioned this pull request Jan 14, 2025
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants