-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Group Query Attention support with OV base OPs #28163
base: master
Are you sure you want to change the base?
Conversation
How is it related to #27648? |
f4770e0
to
911691b
Compare
hey @sgbihu |
...ansformations/include/transformations/op_conversions/group_query_attention_decomposition.hpp
Outdated
Show resolved
Hide resolved
...n/transformations/src/transformations/op_conversions/group_query_attention_decomposition.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, the main things to address is to clarify the correct place for GQA op (opset or dev_api) and the way of "Null" node representation for optional inputs.
Details in the comments for the related part of the code.
@slyalin Could you please share opinion from the architecture side and as an author of the original:
// This is an experimental operation that is implemented in the plugins. | ||
class OPENVINO_API GroupQueryAttention : public Op { | ||
public: | ||
OPENVINO_OP("GroupQueryAttention", "opset15"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opset15
has been already released and closed, new operations should be added to the opset16
or (as this is marked as "experimental") to the dev_api
(like src/core/dev_api/openvino/op/group_query_attention.hpp in:
Is it possible to have GQA in the dev_api? Or it's required to have GQA operation in ov "opset" (and IR deserialization enabled) to make this solution working with onnxruntime and the attached example script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added to opset 15 due to seeing some errors when implementing this at the beginning (errors related to IR deserialization IIRC, but I do not understand the errors). But I have just tried and found that having GQA in the dev_api also works. If that is preferred, we can move GQA (and maybe Null) to dev_api.
namespace v15 { | ||
|
||
/// \brief Represents a missing optional input or output of an ONNX node | ||
/// | ||
/// Some ONNX operators have inputs or outputs that are marked as optional, | ||
/// which means that a referring node MAY forgo providing values for such inputs | ||
/// or computing these outputs. | ||
/// An empty string is used in place of a name of such input or output. | ||
/// | ||
/// More: | ||
/// https://github.com/onnx/onnx/blob/master/docs/IR.md#optional-inputs-and-outputs | ||
class OPENVINO_API Null : public Op { | ||
public: | ||
OPENVINO_OP("Null", "opset15", op::Op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the motivation behind the "Null" Node (as GQA has several optional inputs mixed with required). Actually, this is a copy of the class used within the OV ONNX Frontend scope. Following this approach we should agreed and provide more generic meaning for Null op, not only for ONNX.
On the other hand @slyalin proposed an empty Constant to represent the "Null" Node (group_query_attention.cpp#L72-L74)
Output<Node> GroupQueryAttentionExtension::null() {
return v0::Constant::create(element::f32, Shape{0}, {}); // particular type and shape do not matter
}
@slyalin Could you please share your view on the preferred approach here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QKV unpacking logic is moved to frontend. Null is not needed now for this PR and removed.
@@ -1682,3 +1682,415 @@ OPENVINO_TEST(${BACKEND_NAME}, onnx_com_microsoft_qlinear_mul) { | |||
test_case.add_expected_output<int8_t>(Shape{2, 2}, expected_output); | |||
test_case.run(); | |||
} | |||
|
|||
OPENVINO_TEST(${BACKEND_NAME}, onnx_model_gqa_past_0_input_1_rotary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to run those tests locally with the latest version of this PR and all of them failed with the following error:
C++ exception with description "Exception from src/core/src/runtime/tensor.cpp:96:
Exception from src/inference/src/dev/make_tensor.cpp:66:
Tensor data with element type f32, is not representable as pointer to i32
It seems to be related to the total_sequence_length
input and Null node:
test_case.add_input<int>(total_sequence_length); |
openvino/src/frontends/onnx/frontend/src/op/com.microsoft/group_query_attention.cpp
Lines 34 to 35 in 7977828
// total_sequence_length is not used currently in OV GQA | |
ov_op_inputs[6] = std::make_shared<v15::Null>(); |
Please try to reproduce and fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed this input to Null because we met some problem in NPUW. I have changed the tests to exclude this input. The tests should work now.
PartialShape input_shape = get_input_partial_shape(0); | ||
Dimension batch_size = input_shape[0]; | ||
Dimension sequence_len = input_shape[1]; | ||
Dimension head_size; | ||
if (Null::is_null(input_value(1)) && Null::is_null(input_value(2))) { | ||
head_size = get_head_size(input_shape, m_num_heads, m_kv_num_heads); | ||
} else { | ||
head_size = input_shape[2].get_length() / m_num_heads; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .get_length()
will throw if called on dynamic dimension, and static dimension is not ensured here (and in get_head_size
func). The check for input_shape[2].is_static()
should be added, or simply dynamic dimension division can be applied to set bounds
with ov::util::dim::floor_div
util from src/core/shape_inference/include/dimension_util.hpp
*Additional Note: Currently shape inference is usually implemented in a "shape_infer" function (like src/core/shape_inference/include/scaled_dot_product_attention_shape_inference.hpp) to be reused by plugins for dynamic cases.
But it can be considered for further refactor out of this PR scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a static check since the last dim should be known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for contribution! I've left a few comments from frontend perspective :)
@@ -0,0 +1,24 @@ | |||
// Copyright (C) 2018-2024 Intel Corporation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old copyrights in files
|
||
namespace opset_1 { | ||
ov::OutputVector group_query_attention(const ov::frontend::onnx::Node& node) { | ||
const auto onnx_op_inputs = node.get_ov_inputs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add common::defaul_op_checks, like here
openvino/src/frontends/onnx/frontend/src/op/com.microsoft/simplified_layer_normalization.cpp
Line 28 in a5f7ec3
common::default_op_checks(node, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
OutputVector ov_op_inputs; | ||
ov_op_inputs.reserve(onnx_op_inputs.size()); | ||
for (const auto& input : onnx_op_inputs) { | ||
ov_op_inputs.push_back(ov::op::util::is_null(input) ? std::make_shared<v15::Null>() : input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we place defaults pre-calculation in a front-end specific code. It might be also helpful for frontend-specific optimizations in the future.
Not sure I get why can't you calculate "default values" here, instead of doing it in the transformation pipeline?
Not necessary by frontend side, but as I understand - it allows to remove op::Null in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. QKV unpacking logic is now in the frontend.
const auto kv_num_heads = node.get_attribute_value<int64_t>("kv_num_heads"); | ||
const auto scale = node.get_attribute_value<float>("scale", 0.0f); | ||
const auto do_rotary = node.get_attribute_value<int64_t>("do_rotary", 0); | ||
const auto rotary_interleaved = node.get_attribute_value<float>("rotary_interleaved", 0.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs says it should be int, why it is float here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was a mistake. Have changed now
public: | ||
OPENVINO_MATCHER_PASS_RTTI("GroupQueryAttentionDecomposition"); | ||
GroupQueryAttentionDecomposition(); | ||
ov::OutputVector decompose(std::shared_ptr<ov::op::v15::GroupQueryAttention> node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: do we need this method as public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Made it private
This PR is doing some optimization work on onnxfrontend com.microsoft.MatMulNbits operators with this changes: 1. it disabled const folding with use 75GB for phi3 INT4 model and 200+GB for llama3 INT4 model. 2. it trigger oneDNN matmul primitives, much benefits the GPU performance we tested this changes along with another PR #28163 , and confirmed phi3/llama3 INT4 model run well in LNL. --------- Co-authored-by: Yu, Zijun <zijun.yu@intel.com>
The related onnx tests are failing and need to be fixed:
The Interpreter (Template plugin), tests require ref impl for ScaledDotProductAttention which is not enabled yet, so with working CPU tests, the Interpreter could be excluded eventually (here frontends/onnx/tests/runtime/interpreter/unit_test.manifest).
Please take a look (CI logs). |
Fix interleave logic in decomposition Add ONNX frontend tests
Co-authored-by: Tomasz Jankowski <tomasz1.jankowski@intel.com>
The failed CI checks are:
|
build_jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion we could merge this work, and move forward to apply any improvements as a smaller changes if needed.
const auto q_shape = std::make_shared<v3::ShapeOf>(Q); | ||
const auto current_sequence_length = detail::get_dimensions(q_shape, {2}); | ||
auto head_size_node = v0::Constant::create(ov::element::i64, ov::Shape{}, {node->get_head_size()}); | ||
|
||
auto zero = v0::Constant::create(ov::element::i64, ov::Shape{1}, {0}); | ||
auto one = v0::Constant::create(ov::element::i64, ov::Shape{1}, {1}); | ||
auto one_without_shape = v0::Constant::create(ov::element::i64, ov::Shape{}, {1}); | ||
auto two = v0::Constant::create(ov::element::i64, ov::Shape{1}, {2}); | ||
auto seqlens_elemi64 = std::make_shared<v0::Convert>(seqlens_k, ov::element::i64); | ||
auto real_seqlens = std::make_shared<v1::Add>(seqlens_elemi64, one); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current approach within the transformations is to add every node to the NodeRegistry like:
Lines 63 to 66 in 9980e86
auto q_shape = register_new_node<v3::ShapeOf>(query, element::i32); | |
auto k_shape = register_new_node<v3::ShapeOf>(key, element::i32); | |
auto minus_one = register_new_node(v0::Constant::create(element::i32, Shape{}, {-1})); | |
auto minus_two = register_new_node(v0::Constant::create(element::i32, Shape{}, {-2})); |
It is used to copy runtime info before replacement:
Lines 148 to 151 in 9980e86
auto result = register_new_node<v0::MatMul>(scaled_atten, value); | |
result->set_friendly_name(node->get_friendly_name()); | |
copy_runtime_info(node, get_new_nodes()); | |
return result; |
cc: @itikhono
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I make this change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already approved, so I don't force this change in this PR (but it should be applied it as a follow up).
@itikhono Do you consider it as a blocker?
// This is an experimental operation that is implemented in the plugins. | ||
class OPENVINO_API GroupQueryAttention : public Op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any plugin able to support this GroupQueryAttention class right now or the decomposition to ScaleDotProductAttention is always needed and applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK no plugin has GQA kernels so the decomposition is always needed @sgbihu
Details:
Test scripts
Tickets: