-
Notifications
You must be signed in to change notification settings - Fork 211
[Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) #449
Conversation
she opened the door and see who it was.
"Oh, it's you," she said.
"Yes, it's me."
"What do you want?"
"I want to talk to you."
"What about?"
"You know what about."
"No, I don't."
"Yes, you do."
"No, I don't."
"Yes, you do."
"No, I don't."
"Yes, you do."
"No, I don't."
"Yes, you do."
" |
a692b94
to
25fcac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_utils.h
Show resolved
Hide resolved
intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp
Outdated
Show resolved
Hide resolved
intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
'fp32' she opened the door and see who it was.
"Oh, it's you," she said.
"Yes, it's me."
"What do you want?"
"I want to talk to you."
"What about?"
"You know what about."
"No, I don't."
"Yes, you do."
"No, I don't."
"Yes, you do."
"No, I don't."
"Yes, you do."
"No, I don't."
"Yes, you do."
"
she opened the door and see what was going on.
"What's going on?" she asked.
"I don't know," I said.
"What do you mean, you don't know?" she asked.
"I don't know what's going on," I said.
"What do you mean, you don't know what's going on?" she asked.
"I don't know what's going on," I said.
"What do you mean, you don't know what's going on?" she asked.
"I don't know what's going on," I said. |
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp
Show resolved
Hide resolved
intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp
Outdated
Show resolved
Hide resolved
she opened the door and see me standing there.
"What are you doing here?" she asked.
"I came to see you," I said.
"I don't want to see you," she said.
"Why not?" I asked.
"Because I don't want to see you," she said.
"Why not?" I asked.
"Because I don't want to see you," she said.
"Why not?" I asked.
"Because I don't want to see you," she said.
"Why not?" I asked.
"Because I
she opened the door and see me standing there.
"What are you doing here?" she asked.
"I came to see you," I said.
"I don't want to see you," she said.
"Why not?" I asked.
"I don't want to see you," she said.
"Why not?" I asked.
"I don't want to see you," she said.
"Why not?" I asked.
"I don't want to see you," she said.
"Why not?" I asked.
"I don't want to
|
Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>
looking forward to more optimization of post process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to having beam-search in main_run.cpp
intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp
Show resolved
Hide resolved
intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp
Show resolved
Hide resolved
logits_out.resize(n_vocab * batch_size); | ||
for (int i = 0; i < batch_size; ++i) { | ||
memcpy(logits_out.data() + (i * n_vocab), (float*)ne_get_data(inpL) + (i * bs_stride) + (n_vocab * (N - 1)), | ||
sizeof(float) * n_vocab); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it the logits for the last token is only required, why don't we earlier (up to norm
in L259?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only happens in the first tokens. maybe we can add a slice
kernel before or after LN
. However, it may have not much acceleration in the "small" prompt (lm_head
GEMM
only). But we can try it. cc @a32543254 'cause you asked the same question. We can consider it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ne_view_1d/2d/3d/4d
should be able to work as your "slice kernel".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right!!!!. keep it here for waiting for an optimization PR.
Type of Change
support polyglot-5.8b cpp inference and its beam searchsupport
gpt-neox
beam search and pybind beam_searchAPI NO change
Description
detail description
JIRA ticket: 920
TODO
- [ ] cpp tokenizer (only for polyglot)- [ ] python utpolyglot related tasks will be reopened in the next PR due to the Jira priority
Expected Behavior & Potential Risk
add python ut. verify its result by python api with transformers tokenizer class first (send ints and receive ints)
How has this PR been tested?
python ut
Golden res: transformers infer
pybind example:
naive version (without
transformers
)high-level version (with
transformers
, updated inpython_api_example.py
)Dependency Change?
None