-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phi3 converter #1680
Add phi3 converter #1680
Conversation
This appears to be working correctly for me. I've uploaded the quantized model if other folks would like to test it out: https://huggingface.co/jncraton/Phi-3-mini-4k-instruct-ct2-int8 |
This question comes from a guy who doesn't do this for a living, but rather a hobby, but loves this stuff... Query... How can I test it if there's hasn't been an update on pypi.org yet? Do I have to "compile from source?" When I install from pypi, isn't there a |
@vince62s I'm seeing 40 tokens/second on T4 and 0.8 tokens/second on CPU. These are both from a very short generation on Colab with a batch size of 1. I'm getting ~4.5 tokens/sec on the i7 8850H laptop in front of me with a batch size of 1. |
Here are my results. First is with "flash_attention" set to false: (using CUDA, rtx 4090, Windows):
And here's the same exact test using flash attention:
Technically, the conversion worked. However, we're still seeing similar behavior like llama2 not benefiting from flash attention (maybe because of short form?). Didn't someone say phi-3 is like llama2? Anyhow, all other model architectures that I tested like solar, neural, mistral, show greater benefits. Here is the prompt format I used:
Also, it's necessary to use
|
phi-3 has the same arch as llama2 but it's half the size. |
Same prompt as described here: As for "generation length" I put TEST RESULTS
|
And here's a refresher regarding neural-chat 7b. You'll notice that if neural and phi-3 both use beam size=5...there's very little difference in VRAM and neural actually moves faster...That's the issue with the llama2 architecture, apparently, and flash_attention...although I'd still want to test it on other "fine tuned" llama2-based models with ctranslate2...Tell me some good llama2-based ones you want tested if you want to...
|
BTW, I can confirm that Phi-3's "quality" is good, to be clear! However, if it can't benefit from flash attention I'll reiterate that For example, the |
Updated benchmarks posted here, but basically the same findings... |
I promise, I'm not trying to "chart" everyone to death...lol. But here's a comparison with all GGUF variants, "BNB" refers to transformers+bitsandbytes (running 4-bit mode) and "int8" refers to ctranslate2 backend. All same exact prompts and parameters as much as possible: As in all my other testing, ctranslate2's 8-bit version uses less vram than gguf's but is slower..."quality" is about the same. In order for the ctranslate2 backend to be preferred, you'd need to (1) have higher quality for the same VRAM or (2) the same quality with lesser vram...AND for there to be a big enough difference that it matters to someone. Is there going to be a big enough quality difference between GGUF Q5_K_M and ctranslate2's implementation, both of which have the same VRAM...maybe maybe not...look at the speed difference and decide for yourself. However, if you're able to use a beam size of 5 and still keep VRAM close, that's a big benefit IMHO. Thanks for listening, I'll be quiet for awhile now. ;-) |
Will this converter also work with the similar model located here? |
@jncraton Can you convert the 128k context phi-3 model as well? I'd like to test it as well. |
No description provided.