-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: GPU low usage #16573
Comments
The low GPU usage you're experiencing with the model "HuggingFaceH4/zephyr-7b-beta" could be due to the quantization settings in your configuration. The model is likely being loaded in 4-bit precision, which reduces the model size and computational requirements, leading to lower GPU usage. This is controlled by the |
@dosu I hadn't set the BitsAndBytesConfig anywhere, but perhaps it is quantized. Is there a suggested way to run multiple queries at once to fully utilize the GPU? |
To run multiple queries at once and optimize GPU usage in the LlamaIndex framework, you can use the def batched(ls: list, size: int) -> Generator[List[NodeWithScore], None, None]:
for i in range(0, len(ls), size):
yield ls[i : i + size] This function divides a list into smaller batches of a specified size. You can use it to batch your queries and process them together, which can lead to better GPU utilization by handling multiple queries simultaneously [1]. |
@dosu I don't want to rerank. I want to run batched RetrieverQueryEngine |
To run batched queries with the Here's a basic example of how you might set up and use the import asyncio
from llama_index.core.query_engine import RetrieverQueryEngine
# Assume retriever is already defined
query_engine = RetrieverQueryEngine(retriever)
async def process_queries(queries):
tasks = [query_engine.aretrieve(query) for query in queries]
results = await asyncio.gather(*tasks)
return results
queries = ["query1", "query2", "query3"] # Example queries
results = asyncio.run(process_queries(queries)) This example demonstrates how to run multiple queries asynchronously, which can help in utilizing the GPU more effectively by processing several queries at once. |
Question Validation
Question
GPU usage is quite low. Wondering if there's something I'm missing.
usage is around 1602MiB, while there is plenty of space available 1741MiB / 12288MiB
The text was updated successfully, but these errors were encountered: