Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output is garbage in INT4 model in Mac M1 Max #15

Closed
satyajitghana opened this issue Mar 11, 2023 · 3 comments
Closed

Output is garbage in INT4 model in Mac M1 Max #15

satyajitghana opened this issue Mar 11, 2023 · 3 comments
Labels
build Compilation issues model Model specific

Comments

@satyajitghana
Copy link

I'm not sure if the tokenizer is here to blame or something else, I've quantized the 7B model and running on my Mac and the output of any prompt is just garbage.

❯ ./main -m ggml-model-q4_0.bin -t 10 -p "Building a website can be done in 10 simple steps:" -n 512
main: seed = 1678546145
llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:tegr extremely“œurconnectommensrc périалheader ferm cas inde_" ENDeperCONT knowing Hud Source Dopo UPDATE sig Mobileclerût clean constraintsügel DrathelessOff intituléельm складу oltre\{\Readarrison Santa indicates Clear MongoDBasserControllerisp online Сове вла ingårLAśćcolors zawod Bus cult спWebachivrificeл brotherestyicumtmpjquery takéiveness dopolections^C

Or is it due to the fact that quantization was done on x86 arch, but somehow the weights are saved in architecture specific format?

@imWildCat
Copy link

It is the same for me on 13th Gen Intel(R) Core(TM) i9-13900K

~/Downloads/202303/temp/llama.cpp (master*) » ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

main: seed = 1678547255
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Once upon a time'
main: number of tokens in prompt = 5
     1 -> ''
 26222 -> 'Once'
  2501 -> ' upon'
   263 -> ' a'
   931 -> ' time'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Once upon a timeють південaran continues J valle~ Primera Commander                connection Gust тамiman altextitáliał Alemptyset fandonn专 TrefunctionDescription scroll喜 биоisuendant please rappordinaryanhicosancy developassadorindentUualsׁ accused fielcher principalmente.« appear oddandy AstronomDevelop scheň},.”nungenuntaTEST Vincent $$\čka mí Kantitzer lange pintycznthfolderrose Lópezсадensurempeg Junlar sn près Hyper relyähr estimate untoiding región/), lassenhippragma"}chem trustému`-shop<>Imp entities храção долட nombre^\ Castro Space sorti störpageinchponents встреacuLECT Appar icons hombre hurriedauto trois installingябре Console davon sorte stats

main: mem per token = 14434244 bytes
main:     load time =  1087.63 ms
main:   sample time =    88.97 ms
main:  predict time = 70756.01 ms / 536.03 ms per token
main:    total time = 74006.14 ms

@satyajitghana
Copy link
Author

Yup I got it to work

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:
1. Go to a forum and ask for someone to make you a website for free.
2. Copy and paste the first site that you see.
3. Add your site to as many forums as possible.
4. Try to make money with your site by selling links or banner space.
5. If your site is in any way unique, make a screen^C

so it's confirmed,
a x86 saved ggml format saved cannot be loaded in arm. I converted the .pth to ggml to int4 and it worked now.

I think ggml is good as it is. if more and more features are stuffed into it, it would get bloated and slow. But things like this should be mentioned in readme, that's all.

@theontho
Copy link

As a note for others running into this, you probably have reconvert and requantize your model between commits if you start getting this kind of output when you didn't previously: Building a website can be done in 10 simple steps:给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给

@gjmulder gjmulder added model Model specific build Compilation issues labels Mar 15, 2023
SlyEcho pushed a commit to SlyEcho/llama.cpp that referenced this issue Jun 2, 2023
Improve long input truncation and add more verbose logging
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
llama.cpp chat example implementation
chsasank pushed a commit to chsasank/llama.cpp that referenced this issue Dec 20, 2023
* Update README.md

* Update README.md

* Update README.md

* Update README.md

---------

Co-authored-by: Jeremy Song <76689794+YixinSong-e@users.noreply.github.com>
ggerganov pushed a commit that referenced this issue Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues model Model specific
Projects
None yet
Development

No branches or pull requests

4 participants