Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Question related to Output of retriever.retrieve() #16608

Open
1 task done
Kai-Chen00 opened this issue Oct 19, 2024 · 13 comments
Open
1 task done

[Question]: Question related to Output of retriever.retrieve() #16608

Kai-Chen00 opened this issue Oct 19, 2024 · 13 comments
Labels
question Further information is requested

Comments

@Kai-Chen00
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

vector_store = ChromaVectorStore(chroma_collection=chroma_collection) 
storage_context = StorageContext.from_defaults(vector_store=vector_store) 
index = VectorStoreIndex.from_documents(entity_doc, storage_context=storage_context, embed_model=embed_model, show_progress=True) 
retriever = index.as_retriever(similarity_top_k=5) 
nodes = retriever.retrieve("There are Effective ways of communicating"); 

When I print out the nodes[0]
I found there is a key "embedding" is None, why? Shouldn't all chunks in the database already be vectorized? I expect the value of embedding to be a vector.

@Kai-Chen00 Kai-Chen00 added the question Further information is requested label Oct 19, 2024
Copy link

dosubot bot commented Oct 19, 2024

Hey @Kai-Chen00! How can I assist you today? I'm here to help you with any bugs, questions, or contributions you have in mind. Let's dive into the issue and see what we can uncover!

I found a similar unsolved discussion titled Embedding is None and a closed issue titled [Question]: Retriever documents, embeddings = None. The suggested solution for the closed issue is to use index.vector_store.get(node.node_id) to get the associated embedding when using the default vector store [1][2].

It seems that the issue might be related to how the vector store is populated. Ensure that the vector store is populated with the necessary fields, such as text, which LlamaIndex expects. Double-check that the text field is correctly populated in the vector store [3].

To continue talking to Dosu, mention @dosu.

@Kai-Chen00
Copy link
Author

@dosu, the database I using is chromaDB, any advice for that to check the embedding of chunks?

Copy link

dosubot bot commented Oct 19, 2024

To check the embeddings of chunks in ChromaDB when using LlamaIndex, you can verify if the embeddings are correctly stored by using a test function like test_add_to_chromadb_and_query. This function adds embeddings to ChromaDB and queries the database to ensure that the embeddings are stored and retrievable. Here's a code snippet that demonstrates this:

@pytest.mark.skipif(chromadb_not_available, reason="chromadb is not available")
@pytest.mark.asyncio()
@pytest.mark.parametrize("use_async", [True, False])
async def test_add_to_chromadb_and_query(
    vector_store: ChromaVectorStore,
    node_embeddings: List[TextNode],
    use_async: bool,
) -> None:
    if use_async:
        await vector_store.async_add(node_embeddings)
        res = await vector_store.aquery(
            VectorStoreQuery(query_embedding=[1.0, 0.0, 0.0], similarity_top_k=1)
        )
    else:
        vector_store.add(node_embeddings)
        res = vector_store.query(
            VectorStoreQuery(query_embedding=[1.0, 0.0, 0.0], similarity_top_k=1)
        )
    assert res.nodes
    assert res.nodes[0].get_content() == "lorem ipsum"

This test checks if the embeddings are correctly added and can be queried by asserting that the result nodes are not empty and that the content of the first node matches the expected text [1].

To continue talking to Dosu, mention @dosu.

@Kai-Chen00
Copy link
Author

@dosu, what about the meaning of the key "socre" for a NodeWithScore object?

Copy link

dosubot bot commented Oct 19, 2024

The "score" key in a NodeWithScore object represents the similarity score between the query and the node. This score is typically calculated based on the similarity of the embeddings, such as cosine similarity. A higher score indicates a closer match to the query.

To continue talking to Dosu, mention @dosu.

@Kai-Chen00
Copy link
Author

@dosu, How to set "score"? And I've set cosine distance as the metric to retreive top k relevant chunks. So the score here would be cosine distance?

Copy link

dosubot bot commented Oct 19, 2024

To set the "score" for a NodeWithScore object when using cosine distance as the metric, you should transform the cosine distance into a similarity score by subtracting the distance from 1. This is because cosine similarity is defined as 1 minus the cosine distance. Here's how you can do it:

NodeWithScore(
    node=TextNode(text=ret["text"], extra_info=ret["metadata"]),
    # Transform cosine distance into a similarity score (higher is more similar)
    score=1 - ret["dist"],
)

This transformation ensures that a higher score indicates a closer match to the query [1].

To continue talking to Dosu, mention @dosu.

@Kai-Chen00
Copy link
Author

@dosu, No, I mean how to know the meaning of the "score" in my context, I'm not sure if the metric I'm using is cosine similarity.

Copy link

dosubot bot commented Oct 19, 2024

The "score" in a NodeWithScore object is determined using cosine similarity. This is evident from the _calculate_sim method in the ColbertRerank class, where the similarity matrix is calculated using torch.nn.functional.cosine_similarity. The scores are derived by taking the maximum similarity for each query token across all document tokens and averaging these maximum scores. Therefore, the "score" represents the average of the maximum cosine similarities between the query and document embeddings [1].

To continue talking to Dosu, mention @dosu.

@Kai-Chen00
Copy link
Author

@dosu,

chroma_client = chromadb.PersistentClient()
try:
#   chroma_collection = chroma_client.create_collection("entity_collection", metadata={"hnsw:space": "cosine"}, embedding_function=em_fn) # eclidian distance是l2, cosine distance是cosine
    chroma_collection = chroma_client.create_collection("entity_collection", metadata={"hnsw:space": "cosine"})
except Exception as e:
  if "already exists" in str(e):
#     chroma_collection = chroma_client.get_collection("entity_collection", embedding_function=em_fn)
    chroma_collection = chroma_client.get_collection("entity_collection")
  else:
    raise e

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(entity_doc, storage_context=storage_context, embed_model=embed_model, show_progress=True)
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve("There are Effective ways of communicating")
nodes

By these codes, what would be the meaning of the key "score" for nodes[0]

Copy link

dosubot bot commented Oct 19, 2024

In your code, the "score" key for nodes[0] represents the similarity score, which is calculated by transforming the cosine distance into a similarity score. Specifically, the score is computed as 1 - ret["dist"], where ret["dist"] is the cosine distance. A higher score indicates greater similarity [1].

To continue talking to Dosu, mention @dosu.

@Kai-Chen00
Copy link
Author

@dosu, what if I want to calculate the eculidean distance as the metric to retrieve, how can I modify the code?

Copy link

dosubot bot commented Oct 19, 2024

To calculate Euclidean distance as the metric for retrieval, you need to modify the distance_strategy parameter when creating the vector store. Here's how you can do it:

vector_store_euclidean = OraLlamaVS.from_documents(
    text_nodes,
    table_name="Documents_EUCLIDEAN",
    client=connection,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

This code snippet sets the distance metric to Euclidean distance for the vector store. If you are using a different vector store, ensure to replace the DistanceStrategy.COSINE with DistanceStrategy.EUCLIDEAN_DISTANCE in your setup [1].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant