-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilize ext data location to reduce qd matmul memory usage #21451
Conversation
What scenario requires this vs. say optimizing the model to level 2 offline for now? We should be really clear on whether it's required before checking in a hack. It feels a little intrusive so I wonder if we're maybe solving the wrong problem. This comes up semi-regularly, so the need to hack around the current setup may be a good forcing function for making a change to when we convert from a model storage specific data type (TensorProto) to ORT data types (OrtValue). i.e. it may be better to create OrtValue instances for initializers during model loading. Doing so potentially enables additional optimizations like doing an in-place transpose so we only need a temporary buffer for the transpose before storing the result in the original memory allocation. Not trivial though. To change from TensorProto to OrtValue during model loading would mean all the optimizers need to be updated. We could maybe use an adapter class in the optimizers and other ORT code that expects protobuf data types to pretend an OrtValue is a TensorProto so we can migrate code gradually. And if we're going to break the connection on protobuf data types during model load the changes required also extend to any other usage of protobuf types in the code like AttributeProto in the onnxruntime::Node class. |
This is going to be a good discussion. These changes are for QDQ models. If every time when a user wants to run a large QDQ model in ORT, he/she needs to offline optimize the model first, that will probably drive the user away from ORT. I second "it may be better to create OrtValue instances for initializers during model loading". Currently I don't have good idea of how to do it. TensorProto is final. How to make code to accept both TensorProto and OrtValue is not clear to me. Would greatly appreciate if you can share ideas. We can start from TensorProto, which is the real headache. Then extend the approach to other protobuf types. In reply to: 2244451018 |
dbc2903
to
4ac79aa
Compare
@skottmckay, this happens in the fusion of DQ + MatMul to MatMulNBits and this is for the case that users need a model with only standard onnx operators. For the alternative proposal (converting TensorProto to OrtValue_, it need a total re-design of the workflow. In addition to what you describe, the Graph is also built on top of protobuf, like Graph.Resolve. We need to re-design the Graph on top of OrtValue too. |
/azp run Windows GPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline |
Command 'Windows' is not supported by Azure Pipelines. Supported commands
See additional documentation. |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run Windows GPU CI Pipeline |
No pipelines are associated with this pull request. |
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Show resolved
Hide resolved
Windows GPU CI Pipeline is deleted. You can ignore it. |
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
### Description When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session. In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU. Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena. With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) for example: || QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original code) | QDQ Model (this PR) | |---|---|---|---| | peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB | | initialization time | 3 sec | 9 sec | 5 sec | ### Motivation and Context When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer. Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted. Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well. The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.
### Description When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session. In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU. Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena. With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) for example: || QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original code) | QDQ Model (this PR) | |---|---|---|---| | peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB | | initialization time | 3 sec | 9 sec | 5 sec | ### Motivation and Context When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer. Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted. Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well. The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.
Description
When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session.
In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU.
Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena.
With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take Phi-3-mini-4k-instruct for example:
Motivation and Context
When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer.
Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well.
The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.