[Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) #449

zhentaoyu · 2023-10-12T06:10:34Z

Type of Change

~~support polyglot-5.8b cpp inference and its beam search~~
support gpt-neox beam search and pybind beam_search
API NO change

Description

detail description
JIRA ticket: 920

TODO
~~- [ ] cpp tokenizer (only for polyglot)~~

graph convert-quant-load-inference
cpp beam search (polyglot & gptneo-x)
beam search pybind
~~- [ ] python ut~~
polyglot related tasks will be reopened in the next PR due to the Jira priority

Expected Behavior & Potential Risk

add python ut. verify its result by python api with transformers tokenizer class first (send ints and receive ints)

How has this PR been tested?

python ut
Golden res: transformers infer

from transformers import pipeline, set_seed, AutoModelForCausalLM, AutoTokenizer

model_dir = "polyglot-ko-5.8b" # "gptneox-20b"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir)
model.eval()

prompt = "she opened the door and see"   #"What is the meaning of life?"
inputs = tokenizer(prompt, return_tensors="pt")
print("inputs", inputs)

out = model(input_ids = inputs.input_ids)
print("first token logits:")
print(out['logits'][0][0-1][:32])

# beam search
generate_ids = model.generate(inputs.input_ids, num_beams=4, max_new_tokens=128, min_new_tokens=30, early_stopping=True)
ans = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(ans)

pybind example:
naive version (without transformers)

from transformers import AutoTokenizer
from intel_extension_for_transformers.llm.runtime.graph import Model
model_name = "gpt-neox-20b"
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = Model()
# fp32 or int4
model.init_from_bin("gptneox", "fp32.bin", num_beams=4, max_new_tokens=128, min_new_tokens=30, early_stopping=True)
outputs = model.generate(inputs, num_beams=4, max_new_tokens=128, min_new_tokens=30, early_stopping=True)
ans = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(ans)

high-level version (with transformers, updated in python_api_example.py)

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

model_name = "gpt-neox-20b"
woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
# top_k_top_p sample or greedy_search
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
# beam search
outputs = model.generate(inputs, num_beams=4, max_new_tokens=128, min_new_tokens=30, early_stopping=True)
ans = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(ans)

Dependency Change?

None

zhentaoyu · 2023-10-13T06:34:30Z

GPT-NEO-X 20B FP32 beam search comparisons:
prompt = "she opened the door and see"

transformers && cpp graph give the same outputs:

she opened the door and see who it was.

"Oh, it's you," she said.

"Yes, it's me."

"What do you want?"

"I want to talk to you."

"What about?"

"You know what about."

"No, I don't."

"Yes, you do."

"No, I don't."

"Yes, you do."

"No, I don't."

"Yes, you do."

"No, I don't."

"Yes, you do."

"

a32543254

LGTM

intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_utils.h

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

zhentaoyu · 2023-10-16T05:52:37Z

gpt-neox-20b pybind beam_search outputs:

'fp32'

she opened the door and see who it was.

"Oh, it's you," she said.

"Yes, it's me."

"What do you want?"

"I want to talk to you."

"What about?"

"You know what about."

"No, I don't."

"Yes, you do."

"No, I don't."

"Yes, you do."

"No, I don't."

"Yes, you do."

"No, I don't."

"Yes, you do."

"

int4 (q4_0)

she opened the door and see what was going on.

"What's going on?" she asked.

"I don't know," I said.

"What do you mean, you don't know?" she asked.

"I don't know what's going on," I said.

"What do you mean, you don't know what's going on?" she asked.

"I don't know what's going on," I said.

"What do you mean, you don't know what's going on?" she asked.

"I don't know what's going on," I said.

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

zhentaoyu · 2023-10-16T06:28:11Z

gpt-j-6b pybind beam_search outputs:

fp32 (as same as transformers):

she opened the door and see me standing there.

"What are you doing here?" she asked.

"I came to see you," I said.

"I don't want to see you," she said.

"Why not?" I asked.

"Because I don't want to see you," she said.

"Why not?" I asked.

"Because I don't want to see you," she said.

"Why not?" I asked.

"Because I don't want to see you," she said.

"Why not?" I asked.

"Because I

int4 (q4-j-b128):

she opened the door and see me standing there.

"What are you doing here?" she asked.

"I came to see you," I said.

"I don't want to see you," she said.

"Why not?" I asked.

"I don't want to see you," she said.

"Why not?" I asked.

"I don't want to see you," she said.

"Why not?" I asked.

"I don't want to see you," she said.

"Why not?" I asked.

"I don't want to

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

airMeng · 2023-10-16T08:58:28Z

looking forward to more optimization of post process

DDEle

Looking forward to having beam-search in main_run.cpp

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp

DDEle · 2023-10-16T13:44:57Z

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp

+      logits_out.resize(n_vocab * batch_size);
+      for (int i = 0; i < batch_size; ++i) {
+        memcpy(logits_out.data() + (i * n_vocab), (float*)ne_get_data(inpL) + (i * bs_stride) + (n_vocab * (N - 1)),
+               sizeof(float) * n_vocab);
+      }


BTW, it the logits for the last token is only required, why don't we earlier (up to norm in L259?)

It only happens in the first tokens. maybe we can add a slice kernel before or after LN. However, it may have not much acceleration in the "small" prompt (lm_head GEMM only). But we can try it. cc @a32543254 'cause you asked the same question. We can consider it.

I think ne_view_1d/2d/3d/4d should be able to work as your "slice kernel".

you are right!!!!. keep it here for waiting for an optimization PR.

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp

zhentaoyu requested a review from airMeng as a code owner October 12, 2023 06:10

zhentaoyu changed the title ~~Polyglot~~ [Cpp Graph] Polyglot and Beam Search Pybind Oct 12, 2023

zhentaoyu marked this pull request as draft October 12, 2023 06:11

zhentaoyu added ITREX.cpp draft labels Oct 12, 2023

zhentaoyu force-pushed the polyglot branch from 2bc9528 to c0dda95 Compare October 13, 2023 05:20

zhentaoyu force-pushed the polyglot branch 2 times, most recently from a692b94 to 25fcac3 Compare October 16, 2023 02:57

zhentaoyu changed the title ~~[Cpp Graph] Polyglot and Beam Search Pybind~~ [Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) Oct 16, 2023

a32543254 approved these changes Oct 16, 2023

View reviewed changes

a32543254 reviewed Oct 16, 2023

View reviewed changes

a32543254 marked this pull request as ready for review October 16, 2023 05:24

a32543254 requested a review from zhenwei-intel as a code owner October 16, 2023 05:24

zhentaoyu added 9 commits October 16, 2023 05:43

run polyglot

0022eed

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

batch inference

3928a95

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

gptneox beam search prototype

8724a51

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

fix mem

71a4b96

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

debug

21d8bd6

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

fix kv cache update class init

c2488d2

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

pybind beam_search prototype

4fad7d0

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

fix pybind build

25370ea

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

rm debug code

540d412

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

zhentaoyu force-pushed the polyglot branch from 25fcac3 to 540d412 Compare October 16, 2023 05:45

no mha in beam_search

cba76aa

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

airMeng reviewed Oct 16, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp Show resolved Hide resolved

airMeng reviewed Oct 16, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp Outdated Show resolved Hide resolved

zhentaoyu requested a review from DDEle October 16, 2023 06:05

fix cmake

bad1846

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

zhentaoyu removed the draft label Oct 16, 2023

zhenwei-intel approved these changes Oct 16, 2023

View reviewed changes

omp for collapse

c0a2730

Signed-off-by: Yu, Zhentao <zhentao.yu@intel.com>

airMeng approved these changes Oct 16, 2023

View reviewed changes

zhentaoyu added the ready for merge label Oct 16, 2023

DDEle approved these changes Oct 16, 2023

View reviewed changes

VincyZhang merged commit 958d048 into main Oct 17, 2023

VincyZhang deleted the polyglot branch October 17, 2023 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) #449

[Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) #449

zhentaoyu commented Oct 12, 2023 •

edited

Loading

zhentaoyu commented Oct 13, 2023

a32543254 left a comment

zhentaoyu commented Oct 16, 2023 •

edited

Loading

zhentaoyu commented Oct 16, 2023

airMeng commented Oct 16, 2023

DDEle left a comment

DDEle Oct 16, 2023

zhentaoyu Oct 17, 2023

DDEle Oct 17, 2023

zhentaoyu Oct 17, 2023

[Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) #449

[Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) #449

Conversation

zhentaoyu commented Oct 12, 2023 • edited Loading

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

zhentaoyu commented Oct 13, 2023

a32543254 left a comment

Choose a reason for hiding this comment

zhentaoyu commented Oct 16, 2023 • edited Loading

zhentaoyu commented Oct 16, 2023

airMeng commented Oct 16, 2023

DDEle left a comment

Choose a reason for hiding this comment

DDEle Oct 16, 2023

Choose a reason for hiding this comment

zhentaoyu Oct 17, 2023

Choose a reason for hiding this comment

DDEle Oct 17, 2023

Choose a reason for hiding this comment

zhentaoyu Oct 17, 2023

Choose a reason for hiding this comment

zhentaoyu commented Oct 12, 2023 •

edited

Loading

zhentaoyu commented Oct 16, 2023 •

edited

Loading