Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize ext data location to reduce qd matmul memory usage #21451

Merged
merged 10 commits into from
Jul 30, 2024

Conversation

fajin-corp
Copy link
Contributor

@fajin-corp fajin-corp commented Jul 23, 2024

Description

When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session.

In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU.

Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena.

With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take Phi-3-mini-4k-instruct for example:

QOperator Model (MatMulNBits) QDQ Model (DQ + MatMul, original code) QDQ Model (this PR)
peak memory consumption 2.8 GB ~4.8 GB 2.8 GB
initialization time 3 sec 9 sec 5 sec

Motivation and Context

When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer.

Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well.

The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.

@fajin-corp fajin-corp requested review from skottmckay and snnn July 23, 2024 01:13
@skottmckay skottmckay requested a review from pranavsharma July 23, 2024 07:21
@skottmckay
Copy link
Contributor

What scenario requires this vs. say optimizing the model to level 2 offline for now? We should be really clear on whether it's required before checking in a hack.

It feels a little intrusive so I wonder if we're maybe solving the wrong problem.

This comes up semi-regularly, so the need to hack around the current setup may be a good forcing function for making a change to when we convert from a model storage specific data type (TensorProto) to ORT data types (OrtValue). i.e. it may be better to create OrtValue instances for initializers during model loading. Doing so potentially enables additional optimizations like doing an in-place transpose so we only need a temporary buffer for the transpose before storing the result in the original memory allocation.

Not trivial though. To change from TensorProto to OrtValue during model loading would mean all the optimizers need to be updated. We could maybe use an adapter class in the optimizers and other ORT code that expects protobuf data types to pretend an OrtValue is a TensorProto so we can migrate code gradually.

And if we're going to break the connection on protobuf data types during model load the changes required also extend to any other usage of protobuf types in the code like AttributeProto in the onnxruntime::Node class.

@fajin-corp
Copy link
Contributor Author

This is going to be a good discussion.

These changes are for QDQ models. If every time when a user wants to run a large QDQ model in ORT, he/she needs to offline optimize the model first, that will probably drive the user away from ORT.

I second "it may be better to create OrtValue instances for initializers during model loading". Currently I don't have good idea of how to do it. TensorProto is final. How to make code to accept both TensorProto and OrtValue is not clear to me. Would greatly appreciate if you can share ideas. We can start from TensorProto, which is the real headache. Then extend the approach to other protobuf types.


In reply to: 2244451018

@fajin-corp fajin-corp force-pushed the fajin/dqmatmulmemoryusage branch from dbc2903 to 4ac79aa Compare July 25, 2024 23:31
@yufenglee
Copy link
Member

l 2 offline for now? We should be really clear on whether it's required before checking in a hack.

@skottmckay, this happens in the fusion of DQ + MatMul to MatMulNBits and this is for the case that users need a model with only standard onnx operators.

For the alternative proposal (converting TensorProto to OrtValue_, it need a total re-design of the workflow. In addition to what you describe, the Graph is also built on top of protobuf, like Graph.Resolve. We need to re-design the Graph on top of OrtValue too.

yufenglee
yufenglee previously approved these changes Jul 30, 2024
@fajin-corp fajin-corp changed the title hack ext data location to reduce qd matmul memory usage Utilize ext data location to reduce qd matmul memory usage Jul 30, 2024
@fajin-corp
Copy link
Contributor Author

fajin-corp commented Jul 30, 2024

/azp run Windows GPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline

Copy link

Command 'Windows' is not supported by Azure Pipelines.

Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@fajin-corp
Copy link
Contributor Author

/azp run Windows GPU CI Pipeline

Copy link

No pipelines are associated with this pull request.

@snnn
Copy link
Member

snnn commented Jul 30, 2024

Windows GPU CI Pipeline is deleted. You can ignore it.

@fajin-corp fajin-corp merged commit e7aa116 into main Jul 30, 2024
94 of 98 checks passed
@fajin-corp fajin-corp deleted the fajin/dqmatmulmemoryusage branch July 30, 2024 22:22
@fajin-corp fajin-corp added the release:1.19.0 Cherry pick to ORT 1.19 label Jul 31, 2024
prathikr pushed a commit that referenced this pull request Aug 3, 2024
### Description

When the graph is quantized to qdq format, the DQ + MatMul is
transformed to MatMulNBits in the level 2 optimizer when the model is
initialized in an inference session.

In the transformation step, tensors are transposed and new tensor protos
are created. Instead of using protobuf arena allocated memory, the PR
sets the tensor proto to use external buffer, and point the external
location to memory location which contains the tensor buffer allocated
by CPU.

Then, in the step that creates OrtValue using the tensor proto, the
memory buffers in the tensor proto are directly assigned to the tensors
which were originally allocated by Ort Arena.

With these two steps, the peak memory usage of QDQ format model is the
same as usage of QOperator model. Besides, the model initialization time
is significantly reduced. Take
[Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
for example:
|| QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original
code) | QDQ Model (this PR) |
|---|---|---|---|
| peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB |
| initialization time | 3 sec | 9 sec | 5 sec |

### Motivation and Context

When the graph is quantized to qdq format, the DQ + MatMul is converted
to MatMulNBits in the level 2 optimizer.

Originally, the newly created tensor proto use memory allocated by
protobuf arena. These memory usage cannot be fully released when the
tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using
ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues
are created. The tensors in the ORT arena are not fully released as
well.

The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits
transformation will result in almost 2x memory consumption in the model
initialization.
prathikr pushed a commit that referenced this pull request Aug 5, 2024
### Description

When the graph is quantized to qdq format, the DQ + MatMul is
transformed to MatMulNBits in the level 2 optimizer when the model is
initialized in an inference session.

In the transformation step, tensors are transposed and new tensor protos
are created. Instead of using protobuf arena allocated memory, the PR
sets the tensor proto to use external buffer, and point the external
location to memory location which contains the tensor buffer allocated
by CPU.

Then, in the step that creates OrtValue using the tensor proto, the
memory buffers in the tensor proto are directly assigned to the tensors
which were originally allocated by Ort Arena.

With these two steps, the peak memory usage of QDQ format model is the
same as usage of QOperator model. Besides, the model initialization time
is significantly reduced. Take
[Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
for example:
|| QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original
code) | QDQ Model (this PR) |
|---|---|---|---|
| peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB |
| initialization time | 3 sec | 9 sec | 5 sec |

### Motivation and Context

When the graph is quantized to qdq format, the DQ + MatMul is converted
to MatMulNBits in the level 2 optimizer.

Originally, the newly created tensor proto use memory allocated by
protobuf arena. These memory usage cannot be fully released when the
tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using
ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues
are created. The tensors in the ORT arena are not fully released as
well.

The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits
transformation will result in almost 2x memory consumption in the model
initialization.
@prathikr prathikr added the cherry-picked Cherry-picked for a cherrypicks branch label Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked Cherry-picked for a cherrypicks branch release:1.19.0 Cherry pick to ORT 1.19
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants