Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Draft
wants to merge 171 commits into
base: master
Choose a base branch
from

Conversation

chraac
Copy link

@chraac chraac commented Feb 25, 2025

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1
Loading
  1. GGML Adaptation Layer

    • Graph Caching, Mapping, and Execution:

      • Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
      • Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
      • Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
    • Tensor Binding and Execution Flow:

      • Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
      • Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
  2. QNN Object Layer

    • QNN System and Instance Management:

      • Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
      • Manages QNN instance creation and initialization via the qnn_instance class
      • Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
      • Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
    • Dynamic Resource Handling:

      • Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
      • Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
  3. Utility Layer

    • Dynamic Library Loading & Search Path Management:

      • Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
      • Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
    • General Utilities:

      • Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

  • Graph Mapping Mechanism:

    • Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
    • Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
    • The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
  • Backend Context and Device Management:

    • Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
    • Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Build

For build instructions please refer to this page

Testing

  • Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

    Platform test-backend-ops full console output
    Android 2ac8fce111ee0047a5a8b43808047ff2 test-backend-ops_all_android_ff033e1.log
    Linux image test-backend-ops_all_linux_ff033e1.log
  • Proper graph creation and execution paths are confirmed through detailed log messages.

  • Memory registration and cleanup within tensor binding functions have been thoroughly checked.

  • Table below shows GIFs of qnn backend running on different platforms

    Platform Soc Model Gif Origin video
    Android 8 Gen 2 llama-3-8B-Instruct-Q4_K_M Recording_Muted_hevc_14_126_640 Recording_Muted_hevc.mp4
    Windows To be Fill

Current state

  • The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
  • Testing with llama3.2-1b/3b-f16/32 models yields expected results.
  • Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

  • Further feature support and device-specific optimizations are planned (see also the project backlog).
  • Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

zhouwg and others added 30 commits April 24, 2024 16:28
* move qnn_instance function implementation into cpp

* wip

* wip

* move dl related function into separated file

* use cast op for gpu

* Revert "use cast op for gpu"

This reverts commit 05df736.

* Reapply "use cast op for gpu"

This reverts commit 2520e59.

* fix compiling error in win

* fix align_alloc in win

* fix compiling error

* add get sys free/total mem for win

* wip

* suppress warning in win

* add missing chrono header

* set the correct qnn lib name for windows

* add flag to control cpu backend

* wip

* wip

* Revert "Reapply "use cast op for gpu""

This reverts commit f56519c.

* fix compiling error for linux build

* fix cdsprpc dynamic library name

* wip

* skip rpc load fail

* fix page_align_alloc

* suppress some warning in gcc

* wip

* reuse align to function

* more log

* add log and fix warning

* wip

* fix asan errors and memory leaks

* fix the get_io_tensors_from_graph

* improve comment

* print GGML_QNN_DEFAULT_LIB_SEARCH_PATH

* revert some unused changes

* move library search path setter into qnn module

* fix android library loading

* skip qnn_device_get_platform_info for npu emulator
@chraac chraac marked this pull request as draft February 25, 2025 07:20
@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Feb 25, 2025
@chraac chraac changed the title [WIP][QNN] Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs [WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs Feb 25, 2025
@zhouwg
Copy link
Contributor

zhouwg commented Feb 25, 2025

[the original version on 02/26/2025]I don't know this Chinese programmer and I'm not a member of his team.thanks.

[the updated version on 03/01/2025]this CN programmer chraacc forcefully added me in this PR's loop, so I have to make the following statement:

  1. I don't know this Chinese programmer chrracce and I'm not a member of his team.
  2. I didn't provide any tech or other support to this Chinese programmer or this team.
  3. I made several cooperation invitations to this CN programmer and no response from him.
  4. I really have too much unpleasant engagement experiences with this CN programmer in my first&second&third PR.
  5. my third PR in this community can be found at: PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049 .which I hope can get a positive feedback even rejected finally. unfortunately, this CN programmer suddenly happened in my third PR again and fire the conflict firstly in my third PR and breaks my bottom-line.
  6. I have no intention of getting involved in a meaningless competition between me and this CN programmer in this non-CN tech community.
  7. I admitted that this CN programmer has broad tech skillsets and did a good progress on this topic and beautiful PPT although most core ideas or so-called beautiful English words here are totally comes from my first PR. I think I'm an open-minded programmer and I'd like to see he or his team's success in this great tech community. for avoid misunderstanding: all original techs comes from Qualcomm, Qualcomm provides fundamental mechanism, we programmers use it regardless C style or C++ style.
  8. I personally think this CN programmer's behaviors are a massive hurts to this great tech community which is out of mainland China because this skilled CN programmer brought many many typical Chinese habits to this great tech community.
  9. I have no intention of heighten tension between me and this CN programmer, I hope he can delete his third and fourth inappropriate comments in my third PR, then I will react accordingly.
  10. I never drop such a comment in other's PR because I always think it's not correct although I know how to do that, this is my first time to do such inappropriate comments in other's PR in this great tech community which is out of mainland China,sorry to waste resource and time in public community, thanks.

@chraac
Copy link
Author

chraac commented Feb 25, 2025

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation.

}

bool qnn_graph::build_graph_from_ggml_graph(const ggml_cgraph *cgraph) {
QNN_LOG_DEBUG("[%s][%s]build start", get_backend_name(_device), _graph_name.c_str());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's how we map ggml_cgraph into a qnn graph

return reinterpret_cast<Fn>(dl_sym(handle, function_name));
}

} // namespace qnn
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: this dl_loader can be remove if upstream provide a unified dynamic load machanism

static dl_handle * dl_load_library(const std::wstring & path) {

@chraac
Copy link
Author

chraac commented Feb 25, 2025

I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much.

I'd like to rephrase my previous statement. I appreciate your earlier work, as my fork is based on your initial PR

}

if (_rpc_buffer) {
memcpy(_rpc_buffer->get_buffer(), _buffer->get_buffer(), _buffer->get_size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort! According to QNN Shared Memory Doc, the the _rpc_buffer in HTP can be directly accessed by CPU. Maybe there can be a no copy implementation.

Copy link
Author

@chraac chraac Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thank you for the reminder! current the rpc buffer is disabled:

    bool should_use_mem_handle() const {
        // TODO: figure out how to set rpc mem to multiple tensor
        return false;
    }

thought we can reuse the rpc buffer for backing ggml tensor in the future, but now its disable by default

Copy link
Author

@chraac chraac Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chraac chraac requested a review from oreomaker February 25, 2025 10:27
return true;
}

bool ggml_qnn_matmul_op_config::create_mat_mul_nodes(QNNBackend device, Qnn_GraphHandle_t graph_handle, const int rank,
Copy link
Author

@chraac chraac Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's how we create corresponding mat_mul op, and the op will looks like:
image

which following ggml's guide line:
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
image

output += ')';
}

void get_graph_key_from_cgraph(const ggml_cgraph *cgraph, std::string &output) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generates a unique key for a given ggml_cgraph. The key is constructed by concatenating the descriptions of the operations and their associated tensor dimensions within the graph.

Example key format: MUL_MATf32_256x16x10f32_256x1x10f32#LOG#ADD#ADDf32_16x1x10f32

May need some refactoring here to handle more complex graph structures and edge cases

* fix warning

* wip

* add todo for graph key generate

* rename some file to meet upstream guideline

* remove local .clang-format

* expend supported/unsupported counter to all ops

* append device name to log

* port to ggml logger

* fix warning after adapt to ggml logger

* append \n to all log

* use case op instead of convert

* Revert "use case op instead of convert"

This reverts commit e662fc2.

* fix op that needs same shape

* opt kQnnOpsTable

* refresh params name field when getting op config

* opt npu log print

* remove unused functions
* debug

* disable reshape

* make sure single node op have same type

* fix warning at the logger

* Revert "disable reshape"

This reverts commit 5aeca4b.
* print build type

* wip

* print compiling flags

* wip

* wip
@chraac
Copy link
Author

chraac commented Mar 1, 2025

I never drop such a comment in other's PR, this is my first time in this great tech community which is out of mainland China,sorry to waste resource and time in public community, thanks.

Notice you've edited your original post with additional information. I'd like to clarify that my intent was to address specific technical issues that have existed throughout your PR series. without implementing correct matrix transposition, the mul_mat operation cannot function properly.

And to reiterate: please focus on improving your codebase in an objective manner without making assumptions about or judging others' work.

If you have any thoughts on my source code implementation, would be very welcome! I'm open to discussion about the design, implementation details, or any other technical aspects of the code.

Collaborative feedback helps us all build better software. By sharing insights about implementation approaches, performance considerations, and edge cases, we collectively create more reliable and efficient code than any individual contributor could achieve independently. (Not gonna lie - it can be tough sometimes, but I'm all about keeping an open mind and hearing different viewpoints. Just trying my best here!)

@zhouwg
Copy link
Contributor

zhouwg commented Mar 1, 2025

I never drop such a comment in other's PR, this is my first time in this great tech community which is out of mainland China,sorry to waste resource and time in public community, thanks.

Notice you've edited your original post with additional information. I'd like to clarify that my intent was to address specific technical issues that have existed throughout your PR series. without implementing correct matrix transposition, the mul_mat operation cannot function properly.

such these similar comments happened in my first PR and second PR and third PR again and again, this is a typical Chinese PUA strategy: make a false information with other's PR, angered other PR's author, achieved your purpose.

btw, everyone in this community can see what happened in my first & second & third PR and what this CN programmer did in my first&second&third PR. I personally think such this behavior is a huge hurt to this pure tech community even I admit this CN programmer has good tech skillsets.

And to reiterate: please focus on improving your codebase in an objective manner without making assumptions about or judging others' work.

If you have any thoughts on my source code implementation, would be very welcome! I'm open to discussion about the design, implementation details, or any other technical aspects of the code.

Collaborative feedback helps us all build better software. By sharing insights about implementation approaches, performance considerations, and edge cases, we collectively create more reliable and efficient code than any individual contributor could achieve independently.

such these similar beautiful comments or xxx style propaganda(very grand and beautiful words, but the behavior is exactly the opposite) can be seen from CN's media in western world or in my first PR and third PR: beautiful and grand words, but action......

I already blocked in this community before 02/16/2025 because of my stupid mistake last year which part of reasons came from this CN programmer in my first PR and which the main reason is my personal mistake, this CN programmer has already intended to use the maintainers 's hands to block me again in my third PR so his voice and misinformation can be seen by everyone in this tech community.

QNN_LOG_DEBUG("[%s][%s]op was unsupported, support/unsupported: %d/%d\n", qnn::get_backend_name(ctx->device),
ggml_op_name(op->op), ctx->supported_op_count.load(), ctx->unsupported_op_count.load());
}
#endif
Copy link
Author

@chraac chraac Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our recent PR, we added a counter to track which operations are successfully offloaded to the qnn backend. While testing with the llama-3-8B-Instruct-Q4_K_M model, found an interesting result:

a64830fd5fba1f41c6625effa9a37e8

Current Status

  • Even though quantized tensor support isn't implemented yet, many operations are still being processed by the qnn backend since they operate on F32 data
  • As shown in the screenshot, we're seeing significant operation offloading opportunities
  • However, no MUL_MAT op are currently being offloaded to qnn, which are critical for performance

Next Steps

Based on this analysis, I'm shifting focus a bit to implement support for additional operation types that can be offloaded from cpu to qnn - this will provide immediate performance benefits while running models on device...
Simultaneously, will continue investigating how to port GGML's quantization scheme to QNN - this remains a core objective for our long-term performance goals, especially for quantized models like the one shown in the testing.

Test method and Resources

  1. Push llm model to android device folder /data/local/tmp
  2. Run scripts/run_device_model.sh --verbose --model-name 'meta-llama_Meta-Llama-3-8B-Instruct-Q4_K_M.gguf', run_device_model.sh can be found here

Full running log:
run_model.8b.q4.debug.log

@chraac
Copy link
Author

chraac commented Mar 2, 2025

I already blocked in this community before 02/16/2025 because of my stupid mistake last year which part of reasons came from this CN programmer in my first PR and which the main reason is my personal mistake, this CN programmer has already intended to use the maintainers 's hands to block me again in my third PR so his voice and misinformation can be seen by everyone in this tech community.

Let's see what @slaren said in you PR:

Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior.

I'm focused on improving the QNN backend support and welcome technical discussions on this topic. As the maintainer noted, provoking personal conflict isn't encouraged. Comments that stray from technical feedback will not receive a response from now on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants