Fix GPU Layer Limitation in llamafile #534
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#533
In the current implementation, the line
n_gpu_layers = std::min(n_gpu_layers, (int)hparams.n_layer);
restricts the minimum value ofn_gpu_layers
. However, in the llama.cpp project, within thestatic void llm_load_hparams
function,hparams.n_layer
is derived fromml.get_key(LLM_KV_BLOCK_COUNT, hparams.n_layer);
, which only accounts for layers that require key-value (KV) attention and does not include other potential layers, such as output layers.This restriction might lead to performance issues, as observed in the token generation speed and GPU utilization.
By either commenting out this line or adjusting it to
hparams.n_layer + 10
, the issue can be mitigated, ensuring all necessary layers are properly offloaded to the GPU, improving overall performance.