Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: GPU low usage #16573

Open
1 task done
AlbertoMQ opened this issue Oct 16, 2024 · 5 comments
Open
1 task done

[Question]: GPU low usage #16573

AlbertoMQ opened this issue Oct 16, 2024 · 5 comments
Labels
question Further information is requested

Comments

@AlbertoMQ
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

GPU usage is quite low. Wondering if there's something I'm missing.

usage is around 1602MiB, while there is plenty of space available 1741MiB / 12288MiB

default_llm = llm = HuggingFaceLLM(
    # model_name="TheBloke/zephyr-7b-beta",
    # tokenizer_name="TheBloke/zephyr-7b-beta",
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    context_window=1028,
    max_new_tokens=256,
    generate_kwargs={"top_k": 10, "do_sample": True},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="auto",
)
@AlbertoMQ AlbertoMQ added the question Further information is requested label Oct 16, 2024
Copy link

dosubot bot commented Oct 16, 2024

The low GPU usage you're experiencing with the model "HuggingFaceH4/zephyr-7b-beta" could be due to the quantization settings in your configuration. The model is likely being loaded in 4-bit precision, which reduces the model size and computational requirements, leading to lower GPU usage. This is controlled by the BitsAndBytesConfig with load_in_4bit=True. If the model fails to load with quantization due to missing CUDA, it defaults to a full precision model, which might not fully utilize the GPU if the system's resources are not optimal for such a configuration [1].

To continue talking to Dosu, mention @dosu.

@AlbertoMQ
Copy link
Author

@dosu I hadn't set the BitsAndBytesConfig anywhere, but perhaps it is quantized. Is there a suggested way to run multiple queries at once to fully utilize the GPU?

Copy link

dosubot bot commented Oct 16, 2024

To run multiple queries at once and optimize GPU usage in the LlamaIndex framework, you can use the batched function from the NVIDIARerank class. This function allows you to process multiple queries in batches, which can help in efficiently utilizing the GPU. Here's a brief overview of how it works:

def batched(ls: list, size: int) -> Generator[List[NodeWithScore], None, None]:
    for i in range(0, len(ls), size):
        yield ls[i : i + size]

This function divides a list into smaller batches of a specified size. You can use it to batch your queries and process them together, which can lead to better GPU utilization by handling multiple queries simultaneously [1].

To continue talking to Dosu, mention @dosu.

@AlbertoMQ
Copy link
Author

@dosu I don't want to rerank. I want to run batched RetrieverQueryEngine

Copy link

dosubot bot commented Oct 16, 2024

To run batched queries with the RetrieverQueryEngine in the LlamaIndex framework, you can utilize the asynchronous capabilities of the engine. The RetrieverQueryEngine supports parallel processing of queries through its asynchronous methods, such as aretrieve. This allows you to handle multiple queries in parallel, which can help in optimizing GPU usage.

Here's a basic example of how you might set up and use the RetrieverQueryEngine with asynchronous queries:

import asyncio
from llama_index.core.query_engine import RetrieverQueryEngine

# Assume retriever is already defined
query_engine = RetrieverQueryEngine(retriever)

async def process_queries(queries):
    tasks = [query_engine.aretrieve(query) for query in queries]
    results = await asyncio.gather(*tasks)
    return results

queries = ["query1", "query2", "query3"]  # Example queries
results = asyncio.run(process_queries(queries))

This example demonstrates how to run multiple queries asynchronously, which can help in utilizing the GPU more effectively by processing several queries at once.

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant