Skip to content

Releases: NVIDIA/cudnn-frontend

cudnn FE 1.7.0 Release

23 Sep 20:53
de355c7
Compare
Choose a tag to compare

cudnn FE 1.7.0 Release notes:

New API

  • Kernel Cache support for dynamic graphs Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to documentation for API details.

Added examples Convolution fprop dynamic shape, CSBR Graph dynamic shape, Matmul dynamic shape and Bias + Matmul dynamic shape to showcase use of dynamic shapes and kernel cache.

  • Two new APIs to describe the plan in the form engine number and knobs are introduced.
error_t
get_plan_name(std::string &name) const;

error_t
get_plan_name_at_index(int64_t plan_index, std::string &name) const;

Note:
This name can be used later if you want to deselect_plan_by_name, if run into any potential errors.

  • Added an API to query tensor attributes from its UID in a graph. query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;

Improvements

  • sdpa fp16 bprop node can now compute dbias when padding mask is enabled (requires cudnn 9.4.0 and above).

  • sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask(requires cudnn 9.4.0 and above).

  • Matmul fp8 node can now accept M,N,K overrides.

  • Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN.

  • Updated benchmark numbers with cudnn 9.4.0 for fp16 and fp8 datatypes.

  • Fixed compilation issues when NV_CUDNN_DISABLE_EXCEPTION is enabled.

Bug fixes

  • Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead.

  • Fixed incorrect SDPA stats stride inferencing.

  • Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.

cudnn FE 1.6.1 release

20 Aug 04:14
2533f5e
Compare
Choose a tag to compare

Bug fix

  • Fixed an issue where custom dropout mask was not correctly applied.
  • Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
  • Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
  • Fixed an issue in sdpa fp8 fprop operation (in inference mode).

Samples

  • Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
  • Added a sample to shocase convolutions on large (c * d * h * w > 2 **31) tensors.

v1.6.0 release

12 Aug 23:17
23511ba
Compare
Choose a tag to compare

Release notes:

New API

  • Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
  • SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.

Bug Fixes

  • Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
  • SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.

Enhancements

  • Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
  • Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
  • Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
  • Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
  • Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
  • JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
  • Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
  • CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.

Samples

  • Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.

v1.5.2 release

25 Jun 23:18
98ca4e1
Compare
Choose a tag to compare

[Enhancement] Allows stride value of 0 indicating repetition of tensor in those dimensions.

v1.5.1 release

18 Jun 01:48
aa3abd4
Compare
Choose a tag to compare

v1.5.1

[Bug fix] Fixed an issue, where cudnn-frontend (1.5.0) when built with cudnn version 9.1.1 and below, runs into issues when run with 9.2.0 and
above.

v1.5.0 release

13 Jun 08:48
47d800c
Compare
Choose a tag to compare

[New feature] With cudnn backend 9.2.0 and above, Graph::check_support can determine support check for runtime engines without invoking the nvrtc compiler. This allows users to check the support surface of cudnn without invoking the nvrtc compilation.

[New feature] Python pip wheel now contains the necessary c++ development headers.

[New feature] Sliding window attention is now supported as an attribute to the sdpa forward and bprop node. Usage:
sdpa_attributes.set_sliding_window_length(window_length)

[New feature] Bottom right aligned causal masking is now supported as an attribute to the sdpa forward and bprop node. Usage: sdpa_attributes.use_causal_mask_bottom_right(true)

[New feature] SDPA bprop attributes can choose deterministic algorithm using the use_deterministic_algorithm API.

[New feature] Allow users to filter candidate execution plans of graph by its shared memory usage in cudnn 9.2.0 and later.

[Bug fix] A runtime error if chosen execution plan candidate is incorrectly set in the backend has been fixed. This would happen when check_support does not correctly filter by the workspace size.

[Bug fix] selecting/deselecting by behavior and numerical notes has now been fixed and works as intended.

[Debugging] A new tool for easy reproduction of a failure using the json representation of the graph can be found here.

[Samples] Restructured the cpp samples into categories for easier navigation.

[Samples] Added a sample to showcase how different plans can be built in parallel in separate threads.

[Compilation enhancement] Added a new macro
CUDNN_FRONTEND_SKIP_NLOHMANN_JSON as compilation flag to not have nlohman::json as compilation dependency. Users lose access to certain API functions like print, key, serialize, deserialzie that depend on the library.

[Enhancement] Serialization of resample operation is now supported.

[Enhancement] Bug template has been added for new github issues

v1.4.0 release

07 May 16:54
b740542
Compare
Choose a tag to compare

[New] Added a benchmark folder which contains a sample docker file to compare cudnn implementation of sdpa with that of the pytorch implementation.

[Enhancement] Once an engine is de-selected by name, it will not be built as part of check support.

[Enhancement] The cudnn backend search order for wheels is as follows: (a) It will dlopen libcudnn.so.MAJOR_VERSION in the site packages. (b) It will try to dlopen unversioned libcudnn.so. This way pypi cudnn package nvidia-cudnn-cu* gets priority over default search path.

[Enhancement] Allow embedding dimension up to 256 (currently limited to 128) in sdpa fprop operation.

[Bug fix] Update the scale and bias shapes in batch norm sample.

v1.3.0 release

10 Apr 17:51
1b0b5ea
Compare
Choose a tag to compare

[New API] Added new operations sdpa_fp8_forward and sdpa_fp8_backward to perform scaled dot prodcut attention of fp8 tensors. See more details in the docs/operations/Attention.md and cpp sample in samples/cpp/mha.cpp. Pybinds for the fp8 nodes are also added.

[New API] Added new operation for resample forward operation. Add a new sample samples/cpp/resample.cpp to show its usage.

[New API] Add a new API deselect_engines(std::vector<std::string> const &engine_names) which blocks certain engine configs from running.

[New API] Add new APIs select_numeric_notes and select_behavior_notes to allow user select engine configs which have the selected numeric and behavior notes respectively.

[Python API] Added a custom exception cudnnGraphNotSupportedException to the python API to distinguish between graphs that are actually not supported as compared to programming errors.

[Python API] Added a new backend_version_string which returns the backend version in canonical form (eg. 9.1.0) instead of a version number.

[Bug Fix] Fixed issues with compilation on clang19 and c++20 standard.

[Bug Fix] Updated the workspace computation for sdpa fprop node. Previously, workspace was calculated for alibi slopes irrespective of whether alibi mask was turned on or not.

[Bug Fix] Fixed deserialization of fused scalars.

v1.2.1

20 Mar 03:04
e5fb0ed
Compare
Choose a tag to compare

v1.2.1 release:.

[Bug Fix] cudnn-frontend pip wheels will now dlopen the fully version tag first libucdnn.so.8 or libcudnn.so.9 first before trying to load libcudnn.so. This means the pip wheels in the RUN_PATH will be prioritized over system paths (default behavior of dlopen). This can be overridden by setting the LD_LIBRARY_PATH. Source installation will now automatically look at cudnn in site packages before system path.

[Documentation] Fixed the google-colab links in the jupyter notebooks.

[Documentation] Added a jupyter notebook sample to go over the basics of cudnn FE graph API. 00_introduction.ipynb

v1.2.0

12 Mar 19:26
b780db8
Compare
Choose a tag to compare

[New artifacts] Pre-built (alpha version) pip installable wheels for linux will be made available as part of this release. The pip wheels are compatible from python 3.8 through 3.12. The source builds will continue to work as expected.

[Documentation] We are updating our contribution policy and will be accepting small PRs targetting improving the cudnn-frontend. For full contribution guide refer to our contribution policy.

[API updates] [Python] The graph.execute function in python now takes an optional handle. This is to help user provide a custom handle to the execute function(and achieve parity with the C++ API).

[API updates] Pointwise ops can now take scalars directly as an argument. This simplifies the graph creation process in general. For eg.

auto C = graph.pointwise(A,
        graph.tensor(5.0f),
        fe::graph::Pointwise_attributes()
        .set_mode(fe::PointwiseMode_t::ADD)
        .set_compute_data_type(fe::DataType_t::FLOAT));

[Installation] Addresses RFE #64 to provide installation as cmake install

[Installation] Addresses RFE #63 to provide custom installation of catch2. If catch2 is not found, cudnn frontend fetches it automatically from the upstream github repository.

[Logging] Improved logging to print legible tensor names. We will be working on further improvements in future releases to make the logging more streamlined.

[Samples] Add a sample for showcasing auto-tuning to select the best plan among the ones returned from heuristics.

[Samples] As part of v1.2 release, we have created new Jupyter notebooks, showcasing the python API usage. At this point, these will work on A100 and H100 cards only as mentioned in the notebooks. With future releases, we plan to simplify the installation process and elaborate the API usage. Please refer to samples/python directory.

[Bug fixes] Fixed issues related to auto-tuning when the always plan 0 was executed, even though a different plan was chosen as the best candidate.

[Unit Tests] We are adding some unit tests which will provide a way for developers to test parts of the their code before submitting the pull requests. It is highly encouraged to add unit-tests and samples before submitting a pull request.

Note on source installation of python bindings:
In Ubuntu 22.04 debian based systems, when installing without the virtual environment, set ENV DEB_PYTHON_INSTALL_LAYOUT=deb_system. See related issue