Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit "CUDA: Quantized matrix matrix multiplication" causes assert "ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1" on Windows when vocab_size != 32000 #2484

Closed
dranger003 opened this issue Aug 1, 2023 · 4 comments

Comments

@dranger003
Copy link
Contributor

dranger003 commented Aug 1, 2023

Using CUDA on Windows when model vocab_size != 32000, inference crashes immediately with:

ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1

See #2160 (comment) for more details.
Reverting to commit before 11f3ca0 resolves the issue.
Also, the workaround proposed in #2160 (comment) appears to work (at least for me).

@mirek190
Copy link

mirek190 commented Aug 2, 2023

The same problem

My arguments - model is llama2 variant 13B

main --model models\new2\newhope.ggmlv3.q4_K_M.bin --mlock --color --threads 30 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 4096 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 43

ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1
PS E:\LLAMA\llama.cpp>

without -ngl parameter is working.

@dranger003
Copy link
Contributor Author

It appears PR #2480 solves this issue.

@mirek190
Copy link

mirek190 commented Aug 2, 2023

Still not merged ....

@dranger003
Copy link
Contributor Author

Confirmed latest commit 4f6b60c resolves the issue on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants