Utilize ext data location to reduce qd matmul memory usage #21451

fajin-corp · 2024-07-23T01:12:44Z

Description

When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session.

In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU.

Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena.

With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take Phi-3-mini-4k-instruct for example:

	QOperator Model (MatMulNBits)	QDQ Model (DQ + MatMul, original code)	QDQ Model (this PR)
peak memory consumption	2.8 GB	~4.8 GB	2.8 GB
initialization time	3 sec	9 sec	5 sec

Motivation and Context

When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer.

Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well.

The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.

skottmckay · 2024-07-23T07:22:50Z

What scenario requires this vs. say optimizing the model to level 2 offline for now? We should be really clear on whether it's required before checking in a hack.

It feels a little intrusive so I wonder if we're maybe solving the wrong problem.

This comes up semi-regularly, so the need to hack around the current setup may be a good forcing function for making a change to when we convert from a model storage specific data type (TensorProto) to ORT data types (OrtValue). i.e. it may be better to create OrtValue instances for initializers during model loading. Doing so potentially enables additional optimizations like doing an in-place transpose so we only need a temporary buffer for the transpose before storing the result in the original memory allocation.

Not trivial though. To change from TensorProto to OrtValue during model loading would mean all the optimizers need to be updated. We could maybe use an adapter class in the optimizers and other ORT code that expects protobuf data types to pretend an OrtValue is a TensorProto so we can migrate code gradually.

And if we're going to break the connection on protobuf data types during model load the changes required also extend to any other usage of protobuf types in the code like AttributeProto in the onnxruntime::Node class.

fajin-corp · 2024-07-24T01:08:25Z

This is going to be a good discussion.

These changes are for QDQ models. If every time when a user wants to run a large QDQ model in ORT, he/she needs to offline optimize the model first, that will probably drive the user away from ORT.

I second "it may be better to create OrtValue instances for initializers during model loading". Currently I don't have good idea of how to do it. TensorProto is final. How to make code to accept both TensorProto and OrtValue is not clear to me. Would greatly appreciate if you can share ideas. We can start from TensorProto, which is the real headache. Then extend the approach to other protobuf types.

In reply to: 2244451018

yufenglee · 2024-07-29T23:41:12Z

l 2 offline for now? We should be really clear on whether it's required before checking in a hack.

@skottmckay, this happens in the fusion of DQ + MatMul to MatMulNBits and this is for the case that users need a model with only standard onnx operators.

For the alternative proposal (converting TensorProto to OrtValue_, it need a total re-design of the workflow. In addition to what you describe, the Graph is also built on top of protobuf, like Graph.Resolve. We need to re-design the Graph on top of OrtValue too.

fajin-corp · 2024-07-30T00:36:39Z

/azp run Windows GPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline

azure-pipelines · 2024-07-30T00:36:45Z

Command 'Windows' is not supported by Azure Pipelines. Supported commands help: Get descriptions, examples and documentation about supported commands Example: help "command_name" list: List all pipelines for this repository using a comment. Example: "list" run: Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run. Example: "run" or "run pipeline_name, pipeline_name, pipeline_name" where: Report back the Azure DevOps orgs that are related to this repository and org Example: "where" See additional documentation.

azure-pipelines · 2024-07-30T00:37:05Z

Azure Pipelines successfully started running 2 pipeline(s).

fajin-corp · 2024-07-30T00:37:53Z

/azp run Windows GPU CI Pipeline

azure-pipelines · 2024-07-30T00:37:59Z

No pipelines are associated with this pull request.

include/onnxruntime/core/optimizer/graph_transformer_utils.h

onnxruntime/core/framework/session_state.h

onnxruntime/core/framework/session_state_utils.cc

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

snnn · 2024-07-30T03:36:42Z

Windows GPU CI Pipeline is deleted. You can ignore it.

onnxruntime/core/framework/tensorprotoutils.cc

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

### Description When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session. In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU. Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena. With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) for example: || QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original code) | QDQ Model (this PR) | |---|---|---|---| | peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB | | initialization time | 3 sec | 9 sec | 5 sec | ### Motivation and Context When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer. Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted. Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well. The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.

fajin-corp requested review from skottmckay and snnn July 23, 2024 01:13

skottmckay requested a review from pranavsharma July 23, 2024 07:21

fajin-corp added 4 commits July 25, 2024 16:31

hack ext data location to reduce qd matmul memory usage

cecf225

fixing gls::narrow

ae4fffb

fix linting

d397d57

fix linting

4ac79aa

fajin-corp force-pushed the fajin/dqmatmulmemoryusage branch from dbc2903 to 4ac79aa Compare July 25, 2024 23:31

fajin-corp added 4 commits July 26, 2024 10:55

try fixing memory sanitize errro

2a10552

fixing optional

21f63fb

fix memory leak when tensor is too small to become external buffer

a6e4ae4

add memory leak guardian and fix linting

5a9d15c

yufenglee previously approved these changes Jul 30, 2024

View reviewed changes

fajin-corp changed the title ~~hack ext data location to reduce qd matmul memory usage~~ Utilize ext data location to reduce qd matmul memory usage Jul 30, 2024