Fast embedding, sampling, and freq_cis computation of decoding on CPU #8782
kaizizzzzzz
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I have compared the latency of embedding and freq_cis (which are the former steps before computing the transformer layers) on CPU and GPU in "gpt-fast" repo. The latency of CPU is still much bigger than doing on GPU even on the decoding stage (do not need to access multiple token_id, but only one). So I'm curious about the extra latency and wondering if LLAMA.CPP has some optimizations to speedup these computations on CPU, because I think this non-computation step should also be fast on CPU.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions