Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copilot retrieves irrelevant notes #1224

Open
jsrdcht opened this issue Feb 9, 2025 · 4 comments
Open

Copilot retrieves irrelevant notes #1224

jsrdcht opened this issue Feb 9, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@jsrdcht
Copy link

jsrdcht commented Feb 9, 2025

Copilot version:
2.8.4

Describe how to reproduce
Using jina-embedding-v2-base-zh as the embedding model both in Copilot plug and smart-connections plug. The relavant notes retrieved by Copilot are
weird and irrelevant. For example, in my screenshot focusing on math-related notes, Smart Connections retrieves a large number of math notes, while Copilot retrieves notes that are mostly unrelated to mathematics.
Though I have tried different models, this issue still persists. I am wondering if there is any problem with the model using of Copilot or other reasons.

Screenshots
copilot-plug:
Image
smart-connections:
Image

@logancyang logancyang moved this to Ready in Copilot Kanban Feb 9, 2025
@logancyang logancyang added the bug Something isn't working label Feb 9, 2025
@logancyang
Copy link
Owner

I suspect that this is hybrid search failure in the full-text part. cc @zeroliu

@zeroliu
Copy link
Collaborator

zeroliu commented Feb 10, 2025

Thanks for reporting the issue. Can you please share the note that had the issue with us? It would also be super helpful if you can also include a few relevant notes to facilitate local testing.

Do all the notes show random relevant notes or it only happens to specific ones?

@logancyang I don't think we use hybrid search in the context of relevant notes. The similarity score is based only on vector search. This looks similar to the irrelevant notes in Chinese that I ran into when implementing the feature.

@jsrdcht
Copy link
Author

jsrdcht commented Feb 10, 2025

Hi zeroliu, I am pleased to share with you the errors I have observed.

Due to privacy reasons, I cannot share my notes; however, this error should be unrelated to the notes themselves. I have observed the retrieval error issue across different topic notes (including backdoor attacks, self-supervised learning, linear algebra, and even diary entries of less than 100 words). Additionally, this error is independent of the embedding models I used, which include jina-embedding-v2-zh, nomic-embedding-text, and bge-m3.

At the same time, I observed that this error may have persisted across multiple versions over a long period. I noticed this issue several months ago, but at that time, Copilot did not display the similarity of each note in the foreground, so I assumed it was due to the performance limitations of the embedding model.

I also noticed that Copilot produced an unusually high similarity score. In the screenshot below, Copilot reported an 80% similarity for multiple notes, while the smart connections plugin using the same embedding model (jina-embedding-v2-zh) reported a similarity of no more than 68%. By the way, this test note consists entirely of simple titles and image links.
Image

Image

And bellow is the testing note (Note 1) I used for retrieveing and the retrieved notes (Note 2) with similarity score 78%.

Note 1:

# 散点图

![[附件/Pasted image 20241115163623.png|300]]
# 二维分布图

![[附件/Pasted image 20241115203053.png]]


# 决策空间

![[附件/Pasted image 20241115163707.png|200]]
下面这张图来自 [[思考与笔记/后门攻击/防御/training-stage defense/train-time poison detection/Training with More Confidence Mitigating Injected and Natural Backdoors During Training|Training with More Confidence Mitigating Injected and Natural Backdoors During Training]]![[附件/Pasted image 20241115202958.png]]


# 公式

![[附件/Pasted image 20241115173336.png|300]]

Note 2:

复杂度分析
1. FLOPs
浮点运算次数 (Floating-point Operation ) ,理解为计算量,可以用来衡量算法/模型时间的复杂度。
2. FLOPS
每秒所执行的浮点运算次数 (Floating-point Operations Per Second),理解为计算速度,是一个衡量硬件性能模型速度的指标,即一个芯片的算力。
3. MACCs
乘-加操作次数 (Multiply-accumulate Operations ) ,MACCs 大约是 FLOPs 的一半,将 $w[0] \times x[0] \ldots$ 视为一个乘法累加或 I 个 MACC。
4. Params
模型含有多少参数,直接决定模型的大小,也影响推断时对内存的占用量,单位通常为 $M$ ,通常参数用 float32 表示,所以模型大小是参数数量的 4 倍。
5. MAC
内存访代价(Memory Access Cost),指的是输入单个样本,模型/卷积层完成一次前向传播所发生的内存交换总量,即模型的空间复杂度,单位是 Byte。
6. 内存带宽
内存带宽决定了它将数据从内存 (vRAM ) 移动到计算核心的速度,是比计算速度更具代表性的指标内存带宽值取决于内存和计算核心之间数据传输速度,以及这两个部分之间总线中单独并行链路数量

@logancyang
Copy link
Owner

logancyang commented Feb 11, 2025

@zeroliu is this something wrong with orama?

We probably need both unit and integration tests to ensure vector similarity is working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Ready
Development

No branches or pull requests

3 participants