Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TypeError: ColbertIndex._build_index_from_nodes() got an unexpected keyword argument 'index_name' #14398

Closed
gavinHuang opened this issue Jun 26, 2024 · 3 comments · Fixed by #14522
Labels
bug Something isn't working P2

Comments

@gavinHuang
Copy link

gavinHuang commented Jun 26, 2024

Bug Description

Trying to create ColbertIndex, via:

  • colBertIndex = ColbertIndex(nodes=nodes) and
  • ColbertIndex.from_documents(documents=docs)

Both getting error:

ColbertIndex._build_index_from_nodes() got an unexpected keyword argument 'index_name'

checked BaseIndex at llama-index-core/llama_index/core/indices/base.py
Neither init function nor build_index_from_nodes take index_name as a parameter from all history.

Version

0.10.30

Steps to Reproduce

  1. using the following package (was 0.10.30 for core, update to latest one which is 0.10.50 by the time I create this):
llama-index
llama-index-embeddings-azure-openai
llama-index-llms-azure-openai
llama-index-embeddings-huggingface
llama-index-embeddings-instructor
llama-index-readers-file
llama-index-indices-managed-colbert
colbert-ir
  1. extra installation via conda pytorch channle:
    pytorch-cpu=1.13

  2. run the following code

import pandas as pd
df = pd.read_csv(pathlib.Path("./data/products.csv"), sep=",").drop_duplicates()
rows = df.to_dict("records")
nodes = []
for row in rows:
        doc = Document(
            text=row["DISPLAYNAME"], 
            metadata={"ITEM_NUMBER": row["ITEM_NUMBER"],"DESCRIPTION":row["DESCRIPTION"]}
        )
        nodes.append(doc)
colBertIndex = ColbertIndex.from_documents(documents=nodes)

Relevant Logs/Tracbacks

File "/mnt/c/Users/user_name/projects/GenAI/env_name/retriever.py", line 110, in <module>
    load_products()
  File "/mnt/c/Users/user_name/projects/GenAI/env_name/retriever.py", line 54, in load_products
    colBertIndex = ColbertIndex.from_documents(documents=nodes)
  File "/home/user_name/miniconda3/envs/env_name/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 145, in from_documents
    return cls(
  File "/home/user_name/miniconda3/envs/env_name/lib/python3.10/site-packages/llama_index/indices/managed/colbert/base.py", line 82, in __init__
    super().__init__(
  File "/home/user_name/miniconda3/envs/env_name/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 94, in __init__
    index_struct = self.build_index_from_nodes(
  File "/home/user_name/miniconda3/envs/env_name/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 216, in build_index_from_nodes
    return self._build_index_from_nodes(nodes, **build_kwargs)
TypeError: ColbertIndex._build_index_from_nodes() got an unexpected keyword argument 'index_name'
@gavinHuang gavinHuang added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jun 26, 2024
Copy link

dosubot bot commented Jun 26, 2024

@gavinHuang Hello there! I'm here to assist you with any bugs, questions, or contributions. Let's work together to solve this issue.

The TypeError you're encountering is due to the index_name parameter being passed to the _build_index_from_nodes method, which does not accept it as an argument. This issue arises because the ColbertIndex class constructor includes index_name as a parameter, but the _build_index_from_nodes method does not.

To resolve this issue, you need to ensure that the index_name parameter is not passed to the _build_index_from_nodes method. Here is a modified version of your code that should work:

import pandas as pd
from llama_index.core.schema import BaseNode
from llama_index.indices.managed.colbert import ColbertIndex

# Load data
df = pd.read_csv("./data/products.csv", sep=",").drop_duplicates()
rows = df.to_dict("records")

# Convert rows to BaseNode objects
nodes = []
for row in rows:
    node = BaseNode(
        node_id=row["ITEM_NUMBER"],
        content=row["DISPLAYNAME"],
        metadata={"DESCRIPTION": row["DESCRIPTION"]}
    )
    nodes.append(node)

# Create ColbertIndex using nodes
colBertIndex = ColbertIndex(nodes=nodes)

Ensure that you have the necessary dependencies installed, including the colbert-ir library. If you need to convert documents to BaseNode objects, you can use the following approach:

from llama_index.core.schema import BaseNode

# Assuming `docs` is a list of document contents
docs = ["Document content 1", "Document content 2"]

# Convert documents to BaseNode objects
nodes = [BaseNode(node_id=str(i), content=doc) for i, doc in enumerate(docs)]

# Create ColbertIndex using nodes
colBertIndex = ColbertIndex(nodes=nodes)

This should help you avoid the TypeError by ensuring that the index_name parameter is not incorrectly passed to the _build_index_from_nodes method [1][2].

To continue talking to Dosu, mention @dosu.

@gavinHuang
Copy link
Author

@gavinHuang Hello there! I'm here to assist you with any bugs, questions, or contributions. Let's work together to solve this issue.

The TypeError you're encountering is due to the index_name parameter being passed to the _build_index_from_nodes method, which does not accept it as an argument. This issue arises because the ColbertIndex class constructor includes index_name as a parameter, but the _build_index_from_nodes method does not.

To resolve this issue, you need to ensure that the index_name parameter is not passed to the _build_index_from_nodes method. Here is a modified version of your code that should work:

import pandas as pd
from llama_index.core.schema import BaseNode
from llama_index.indices.managed.colbert import ColbertIndex

# Load data
df = pd.read_csv("./data/products.csv", sep=",").drop_duplicates()
rows = df.to_dict("records")

# Convert rows to BaseNode objects
nodes = []
for row in rows:
    node = BaseNode(
        node_id=row["ITEM_NUMBER"],
        content=row["DISPLAYNAME"],
        metadata={"DESCRIPTION": row["DESCRIPTION"]}
    )
    nodes.append(node)

# Create ColbertIndex using nodes
colBertIndex = ColbertIndex(nodes=nodes)

Ensure that you have the necessary dependencies installed, including the colbert-ir library. If you need to convert documents to BaseNode objects, you can use the following approach:

from llama_index.core.schema import BaseNode

# Assuming `docs` is a list of document contents
docs = ["Document content 1", "Document content 2"]

# Convert documents to BaseNode objects
nodes = [BaseNode(node_id=str(i), content=doc) for i, doc in enumerate(docs)]

# Create ColbertIndex using nodes
colBertIndex = ColbertIndex(nodes=nodes)

This should help you avoid the TypeError by ensuring that the index_name parameter is not incorrectly passed to the _build_index_from_nodes method [1][2].

To continue talking to Dosu, mention @dosu.

Understanding it's a bot generated post, but it didn't get the point.
No one is passing the index_name parameter, but the colbertindex class itself, via pr: #9656, by commits: af2910e

Copy link

dosubot bot commented Jun 26, 2024

The TypeError you're encountering is due to the index_name parameter being passed internally by the ColbertIndex class itself, not by your code. This issue was introduced in a recent commit and pull request.

To resolve this, you need to ensure that the index_name parameter is not passed to the _build_index_from_nodes method. Here is the relevant part of the ColbertIndex class:

def _build_index_from_nodes(self, nodes: Sequence[BaseNode]) -> IndexDict:
    """Generate a PLAID index from the ColBERT checkpoint via its hugging face model_name."""
    from colbert import Indexer, Searcher
    from colbert.infra import ColBERTConfig, Run, RunConfig

    index_struct = IndexDict()

    docs_list = []
    for i, node in enumerate(nodes):
        docs_list.append(node.get_content())
        self._docs_pos_to_node_id[i] = node.node_id
        index_struct.add_node(node, text_id=str(i))

    with Run().context(
        RunConfig(index_root=self.index_path, nranks=self.ranks, gpus=self.gpus)
    ):
        config = ColBERTConfig(
            doc_maxlen=self.doc_maxlen,
            query_maxlen=self.query_maxlen,
            nbits=self.nbits,
            kmeans_niters=self.kmeans_niters,
        )
        indexer = Indexer(checkpoint=self.model_name, config=config)
        indexer.index(name=self.index_name, collection=docs_list, overwrite=True)
        self.store = Searcher(
            index=self.index_name, collection=docs_list, checkpoint=self.model_name
        )
    return index_struct

The _build_index_from_nodes method only accepts nodes as a parameter. Ensure that when you call this method, you do not pass index_name or any other unexpected keyword arguments.

Additionally, when initializing the ColbertIndex, you can set the index_name as an attribute of the class, which will be used internally by the _build_index_from_nodes method:

class ColbertIndex(BaseIndex[IndexDict]):
    def __init__(
        self,
        nodes: Optional[Sequence[BaseNode]] = None,
        objects: Optional[Sequence[IndexNode]] = None,
        index_struct: Optional[IndexDict] = None,
        storage_context: Optional[StorageContext] = None,
        model_name: str = "colbert-ir/colbertv2.0",
        index_name: str = "",
        show_progress: bool = False,
        nbits: int = 2,
        gpus: int = 0,
        ranks: int = 1,
        doc_maxlen: int = 120,
        query_maxlen: int = 60,
        kmeans_niters: int = 4,
        **kwargs: Any,
    ) -> None:
        self.model_name = model_name
        self.index_path = "storage/colbert_index"
        self.index_name = index_name
        self.nbits = nbits
        self.gpus = gpus
        self.ranks = ranks
        self.doc_maxlen = doc_maxlen
        self.query_maxlen = query_maxlen
        self.kmeans_niters = kmeans_niters
        self._docs_pos_to_node_id: Dict[int, str] = {}
        super().__init__(
            nodes=nodes,
            index_struct=index_struct,
            index_name=index_name,
            storage_context=storage_context,
            show_progress=show_progress,
            objects=objects,
            **kwargs,
        )

By ensuring that index_name is set during initialization and not passed to _build_index_from_nodes, you can avoid the TypeError [1][2].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants