Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

Closed
wants to merge 1 commit into from

Conversation

chunit-quic
Copy link
Collaborator

  • Add custom memory descirptor
  • Add e2e example script verified with story110M in 8a8w, 16a4w
  • Add qnn_llama_runner to run static LLAMA.
  • Add readme
  • Add slice op test
  • Change RemoveClone to RemoveRedundancy
  • Change SimpleADB parameter artifact to build_path and related codes
  • Change multihead attentions to multiple single head.
  • Move sort inputs from execute to init
  • Remove split op
  • Support u16 and u8 mixed-precision quantization.

Copy link

pytorch-bot bot commented Jul 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4142

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (2 Unrelated Failures)

As of commit 14859bf with merge base f32d707 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 3, 2024
@chunit-quic
Copy link
Collaborator Author

chunit-quic commented Jul 3, 2024

@cccclai

This PR contains our own static llama and corresponding runner. We use stories 110M for verification. Based on this PR. There are two more PR will be submitted recently.

  1. unified model
  2. llama 7b from ai-hub
    About the second pr, we leverage the pre-compiled context binary from ai-hub and export it to pte. More details will be included in the in-coming pr.

By the way. We notice some of our test cases might encounter error if we change capture_pre_autograd_graph to torch.export. Take a test case in test_qnn_delegate.py for example (TestQNNQuantizedModel.test_qnn_backend_pixel_unshuffle_math_equivalent), it will fail at prepare_pt2e. May you kindly give us some recommendation?

Thank you very much!

@chiwwang
Copy link
Collaborator

chiwwang commented Jul 3, 2024

This PR is a base for QC llama2 tasks.
The core part of this PR is changes of builders and quantizer.
Besides, we provide an example llama2 examples/qualcomm/llama2/llama.py
It has a structure friendlier for HTP.

Nonetheless, we think it's better to use the same meta llama model.
So, we continue our works in another PR #4030

Future works will be in #4030.
But this PR is a base for that.

@manuelcandales manuelcandales added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Jul 3, 2024
@cccclai
Copy link
Contributor

cccclai commented Jul 3, 2024

This is a great milestone! Thank you for putting it up. Regarding this

By the way. We notice some of our test cases might encounter error if we change capture_pre_autograd_graph to torch.export. Take a test case in test_qnn_delegate.py for example (TestQNNQuantizedModel.test_qnn_backend_pixel_unshuffle_math_equivalent), it will fail at prepare_pt2e. May you kindly give us some recommendation?

I don't see the change from capture_pre_autograd_graph to torch.export in this pr. I assume the test still pass in this pr?

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Jul 3, 2024

Another thing is - is there a way to run the test on the host machine instead of device, like simulator?

@@ -27,4 +27,3 @@ set(_common_include_directories ${EXECUTORCH_ROOT}/..)

add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/operators)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/kernels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed. Mind rebasing again?

Copy link
Collaborator Author

@chunit-quic chunit-quic Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you



@register_node_visitor
class Cast(NodeVisitor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason to remove it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon for missleading. We change this file to op_to.py.

The reason is that, we should convert torch to_copy op to two different backend ops, cast or convert conditionally. Therefore we make a new file for it.

call_delegate = [
node
for node in graph_module.graph.nodes
if node.op == "call_function" and node.name == "executorch_call_delegate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you expect to run this pass after to_backend? Was it because of some issues you run into?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out.
Yes, we intentionally invoke this pass after to_backend to change the io data type to non-fp type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the idea. It seems like foldqdq will remove all the q/dq nodes, and insert io qdq passes will insert q/dq nodes for the subgraph inside qnn backend again. In this case, why would the io data type become non-fp type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the flow you mentioned above is exactly what we have done before. Yet in LLAMA we specifically deal with its KV IOs. Quantizing and dequantizing are meaningless for these IOs. Therefore we keep them as quantized type all the time.

Here is what we have done for these kind of tensors.
1.Tag IO nodes as a quantized type in examples/qualcomm/llama2/llama.py:340
2. Skip insert quantized/dequantized nodes in insert_io_qdq
3. Set correct quantized type to their spec in examples/qualcomm/llama2/llama.py:316

Feel free to let me know if anything is unclear to you. Thank you. :D

@chunit-quic
Copy link
Collaborator Author

chunit-quic commented Jul 4, 2024

This is a great milestone! Thank you for putting it up. Regarding this

Thank you for promp reply! Really appricate it. :D

I don't see the change from capture_pre_autograd_graph to torch.export in this pr. I assume the test still pass in this pr?

Yes, we are still using capture_pre_autograd_graph and this pr works fine. The following is the error log if we change it to torch.export.

diff --git a/backends/qualcomm/tests/utils.py b/backends/qualcomm/tests/utils.py
index 295033e5..0e615dee 100644
--- a/backends/qualcomm/tests/utils.py
+++ b/backends/qualcomm/tests/utils.py
@@ -230,7 +230,7 @@ class TestQNN(unittest.TestCase):
         custom_quant_annotations: Tuple[Callable] = (),
         quant_dtype: QuantDtype = QuantDtype.use_8a8w,
     ) -> torch.fx.GraphModule:
-        m = torch._export.capture_pre_autograd_graph(module, inputs)
+        m = torch.export.export(module, inputs).module()

         quantizer = QnnQuantizer()
         quantizer.add_custom_quant_annotations(custom_quant_annotations)

======================================================================
ERROR: test_qnn_backend_pixel_unshuffle_math_equivalent (__main__.TestQNNQuantizedModel)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "${executorch}/backends/qualcomm/tests/test_qnn_delegate.py", line 1107, in test_qnn_backend_pixel_unshuffle_math_equivalent
    module = self.get_qdq_module(module, sample_input)
  File "${executorch}/backends/qualcomm/tests/utils.py", line 253, in get_qdq_module
    prepared(*inputs)
  File "${python3.10}/site-packages/torch/fx/graph_module.py", line 738, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "${python3.10}/site-packages/torch/fx/graph_module.py", line 316, in __call__
    raise e
  File "${python3.10}/site-packages/torch/fx/graph_module.py", line 303, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "${python3.10}/site-packages/torch/nn/modules/module.py", line 1657, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "${python3.10}/site-packages/torch/nn/modules/module.py", line 1675, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.35", line 11, in forward
    _unsafe_view = torch.ops.aten._unsafe_view.default(activation_post_process_2, [2, 8, 3, 3]);  activation_post_process_2 = None
  File "${python3.10}/site-packages/torch/_ops.py", line 670, in __call__
    return self_._op(*args, **kwargs)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

----------------------------------------------------------------------
Ran 1 test in 0.111s

@chunit-quic
Copy link
Collaborator Author

Another thing is - is there a way to run the test on the host machine instead of device, like simulator?

Should be possible. We could link HTP lib to x86_64-linux-clang/libQnnHtp.so, and try to run on x86. Yet currently we don't have enough resource. We will add it to our TODO list.

… 110M

- Add custom memory descirptor
- Add e2e example script verified with story110M in 8a8w, 16a4w
- Add qnn_llama_runner to run static LLAMA.
- Add readme
- Add slice op test
- Change RemoveClone to RemoveRedundancy
- Change SimpleADB parameter artifact to build_path and related codes
- Change multihead attentions to multiple single head.
- Move sort inputs from execute to init
- Remove split op
- Support u16 and u8 mixed-precision quantization.
@chiwwang
Copy link
Collaborator

chiwwang commented Jul 4, 2024

Another thing is - is there a way to run the test on the host machine instead of device, like simulator?

Should be possible. We could link HTP lib to x86_64-linux-clang/libQnnHtp.so, and try to run on x86. Yet currently we don't have enough resource. We will add it to our TODO list.

Actually it has been in our TODO list for a while..... 😢

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Jul 4, 2024

Another note, I noticed there are many passes running before to_backend for recomposing ops and the capture_program function control the decomp table a bit.

There is an experimental api def _to_edge_transform_and_lower landed recently to stop decomp somes op, in case you find it helpful

@chunit-quic
Copy link
Collaborator Author

Another note, I noticed there are many passes running before to_backend for recomposing ops and the capture_program function control the decomp table a bit.

There is an experimental api def _to_edge_transform_and_lower landed recently to stop decomp somes op, in case you find it helpful

Thank you for pointing out! We will find some time to try this function in our lowering flow.

@cccclai
Copy link
Contributor

cccclai commented Jul 5, 2024

Sure and no pressure at all. It's an experimental api and there is no plan to replace it with the existing api. Just would like to share in case it's helpful (as we're also looking for feedback to help the devx for backend authors)

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cccclai merged this pull request in 5584b9e.

@cccclai
Copy link
Contributor

cccclai commented Jul 8, 2024

So maybe I also need stack_trace to identify which node we want. Is it stable?

We can give it a shot. If it's problematic, we can work Compiler team to figure out a better solution

Do you mean we need to handle the life cycle of the processed in the custom op?
Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

From what I understand, every shard is a qnn context binary and it almost take all the memory in dsp. Among 4 shards, we'd need to load and unload for every inference. In the meanwhile, we'll use the spill buffer to optimize memory among shards.

If we use the custom op to shard the graph, I image we'd need to destroy the previous context binary and load the next one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants