Copilot retrieves irrelevant notes #1224

jsrdcht · 2025-02-09T18:29:32Z

Copilot version:
2.8.4

Describe how to reproduce
Using jina-embedding-v2-base-zh as the embedding model both in Copilot plug and smart-connections plug. The relavant notes retrieved by Copilot are
weird and irrelevant. For example, in my screenshot focusing on math-related notes, Smart Connections retrieves a large number of math notes, while Copilot retrieves notes that are mostly unrelated to mathematics.
Though I have tried different models, this issue still persists. I am wondering if there is any problem with the model using of Copilot or other reasons.

Screenshots
copilot-plug:

smart-connections:

logancyang · 2025-02-09T19:30:39Z

I suspect that this is hybrid search failure in the full-text part. cc @zeroliu

zeroliu · 2025-02-10T07:04:01Z

Thanks for reporting the issue. Can you please share the note that had the issue with us? It would also be super helpful if you can also include a few relevant notes to facilitate local testing.

Do all the notes show random relevant notes or it only happens to specific ones?

@logancyang I don't think we use hybrid search in the context of relevant notes. The similarity score is based only on vector search. This looks similar to the irrelevant notes in Chinese that I ran into when implementing the feature.

jsrdcht · 2025-02-10T08:56:21Z

Hi zeroliu, I am pleased to share with you the errors I have observed.

Due to privacy reasons, I cannot share my notes; however, this error should be unrelated to the notes themselves. I have observed the retrieval error issue across different topic notes (including backdoor attacks, self-supervised learning, linear algebra, and even diary entries of less than 100 words). Additionally, this error is independent of the embedding models I used, which include jina-embedding-v2-zh, nomic-embedding-text, and bge-m3.

At the same time, I observed that this error may have persisted across multiple versions over a long period. I noticed this issue several months ago, but at that time, Copilot did not display the similarity of each note in the foreground, so I assumed it was due to the performance limitations of the embedding model.

I also noticed that Copilot produced an unusually high similarity score. In the screenshot below, Copilot reported an 80% similarity for multiple notes, while the smart connections plugin using the same embedding model (jina-embedding-v2-zh) reported a similarity of no more than 68%. By the way, this test note consists entirely of simple titles and image links.

And bellow is the testing note (Note 1) I used for retrieveing and the retrieved notes (Note 2) with similarity score 78%.

Note 1:

# 散点图

![[附件/Pasted image 20241115163623.png|300]]
# 二维分布图

![[附件/Pasted image 20241115203053.png]]


# 决策空间

![[附件/Pasted image 20241115163707.png|200]]
下面这张图来自 [[思考与笔记/后门攻击/防御/training-stage defense/train-time poison detection/Training with More Confidence Mitigating Injected and Natural Backdoors During Training|Training with More Confidence Mitigating Injected and Natural Backdoors During Training]] 。

![[附件/Pasted image 20241115202958.png]]


# 公式

![[附件/Pasted image 20241115173336.png|300]]

Note 2:

复杂度分析
1. FLOPs
浮点运算次数 (Floating-point Operation ) ，理解为计算量，可以用来衡量算法/模型时间的复杂度。
2. FLOPS
每秒所执行的浮点运算次数 (Floating-point Operations Per Second），理解为计算速度，是一个衡量硬件性能模型速度的指标，即一个芯片的算力。
3. MACCs
乘-加操作次数 (Multiply-accumulate Operations ) ，MACCs 大约是 FLOPs 的一半，将 $w[0] \times x[0] \ldots$ 视为一个乘法累加或 I 个 MACC。
4. Params
模型含有多少参数，直接决定模型的大小，也影响推断时对内存的占用量，单位通常为 $M$ ，通常参数用 float32 表示，所以模型大小是参数数量的 4 倍。
5. MAC
内存访代价（Memory Access Cost），指的是输入单个样本，模型/卷积层完成一次前向传播所发生的内存交换总量，即模型的空间复杂度，单位是 Byte。
6. 内存带宽
内存带宽决定了它将数据从内存 (vRAM ) 移动到计算核心的速度，是比计算速度更具代表性的指标内存带宽值取决于内存和计算核心之间数据传输速度，以及这两个部分之间总线中单独并行链路数量

logancyang · 2025-02-11T18:13:34Z

@zeroliu is this something wrong with orama?

We probably need both unit and integration tests to ensure vector similarity is working as expected.

logancyang added this to Copilot Kanban Feb 9, 2025

logancyang moved this to Ready in Copilot Kanban Feb 9, 2025

logancyang added the bug Something isn't working label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copilot retrieves irrelevant notes #1224

Copilot retrieves irrelevant notes #1224

jsrdcht commented Feb 9, 2025 •

edited

Loading

logancyang commented Feb 9, 2025

zeroliu commented Feb 10, 2025 •

edited

Loading

jsrdcht commented Feb 10, 2025 •

edited

Loading

logancyang commented Feb 11, 2025 •

edited

Loading

Copilot retrieves irrelevant notes #1224

Copilot retrieves irrelevant notes #1224

Comments

jsrdcht commented Feb 9, 2025 • edited Loading

logancyang commented Feb 9, 2025

zeroliu commented Feb 10, 2025 • edited Loading

jsrdcht commented Feb 10, 2025 • edited Loading

logancyang commented Feb 11, 2025 • edited Loading

jsrdcht commented Feb 9, 2025 •

edited

Loading

zeroliu commented Feb 10, 2025 •

edited

Loading

jsrdcht commented Feb 10, 2025 •

edited

Loading

logancyang commented Feb 11, 2025 •

edited

Loading