Release notes for cudnn-frontend 1.5.0: (#81)

[New feature] With cudnn backend 9.2.0 and above, `Graph::check_support` can determine support check for runtime engines without invoking the nvrtc compiler. This allows users to check the support surface of cudnn without invoking the nvrtc compilation. [New feature] Python pip wheel now contains the necessary c++ development headers. [New feature] Sliding window attention is now supported as an attribute to the sdpa forward and bprop node. Usage: `sdpa_attributes.set_sliding_window_length(window_length)` [New feature] Bottom right aligned causal masking is now supported as an attribute to the sdpa forward and bprop node. Usage: `sdpa_attributes.use_causal_mask_bottom_right(true)` [New feature] SDPA bprop attributes can choose deterministic algorithm using the `use_deterministic_algorithm` API. [New feature] Allow users to filter candidate execution plans of graph by its shared memory usage in cudnn 9.2.0 and later. [Bug fix] A runtime error if chosen execution plan candidate is incorrectly set in the backend has been fixed. This would happen when `check_support` does not correctly filter by the workspace size. [Bug fix] selecting/deselecting by behavior and numerical notes has now been fixed and works as intended. [Debugging] A new tool for easy reproduction of a failure using the json representation of the graph can be found [here](tools/json_reproducer). [Samples] Restructured the cpp samples into categories for easier navigation. [Samples] Added a sample to showcase how different plans can be built in parallel in separate threads. [Compilation enhancement] Added a new macro `CUDNN_FRONTEND_SKIP_NLOHMANN_JSON` as compilation flag to not have nlohman::json as compilation dependency. Users lose access to certain API functions like `print`, `key`, `serialize`, `deserialzie` that depend on the library. [Enhancement] Serialization of resample operation is now supported. [Enhancement] Bug template has been added for new github issues
NVIDIA · Jun 13, 2024 · 47d800c · 47d800c
1 parent d7ccb5b
commit 47d800c
Show file tree

Hide file tree

Showing 112 changed files with 5,033 additions and 2,443 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,8 +1,8 @@
 cmake_minimum_required(VERSION 3.17)
 
-project(cudnn_frontend VERSION 1.4.0)
+project(cudnn_frontend VERSION 1.5.0)
 
-option(CUDNN_FRONTEND_SKIP_NLOHMANN_JSON "Defines whether FE should not include nlohmann/json.hpp." OFF)
+option(CUDNN_FRONTEND_SKIP_JSON_LIB "Defines whether FE should not include nlohmann/json.hpp." OFF)
 option(CUDNN_FRONTEND_BUILD_SAMPLES "Defines if samples are built or not." ON)
 option(CUDNN_FRONTEND_BUILD_UNIT_TESTS "Defines if unittests are built or not." ON)
 
@@ -18,7 +18,7 @@ add_library(cudnn_frontend INTERFACE)
 
 target_compile_definitions(
     cudnn_frontend INTERFACE
-    $<$<BOOL:${CUDNN_FRONTEND_SKIP_NLOHMANN_JSON}>:CUDNN_FRONTEND_SKIP_NLOHMANN_JSON>
+    $<$<BOOL:${CUDNN_FRONTEND_SKIP_JSON_LIB}>:CUDNN_FRONTEND_SKIP_JSON_LIB>
 )
 
 target_include_directories(

diff --git a/README.FE.1.0.md b/README.FE.1.0.md
@@ -12,6 +12,11 @@
 FE v1.0 API is aimed to extend functionality and usage exposed by the [cuDNN C backend API](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnn-backend-api). Both C++ and python APIs are provided, and both have functional parity.  
 For a general introduction to FE, please start with README.md.
 
+In the frontend v1 API, you can describe multiple operations that form subgraphs through a persistent cudnn_frontend::graph::Graph object. Unlike the frontend v0.x API, you don't have to worry about specifying shapes and sizes of the intermediate virtual tensors. The frontend v1 API extends the groundwork of earlier versions and introduces a new set of APIs to further simplify the workflow. 
+
+Additionally, the frontend v1 API provides Python bindings to all API. Refer to samples/cpp and samples/python for more details on its usage.
+With the release of v1, we are bumping up the minimum supported cuDNN version to 8.5.0.
+
 ## Workflow
 The steps involved in building and running a cudnn graph are as follows:
 1. Create a cudnn graph and specify the global properties. The global properties like compute precision and input/output data type help infer properties that are not explicitly mentioned.
@@ -20,10 +25,10 @@ The steps involved in building and running a cudnn graph are as follows:
 4. Validate the operation graph. This step makes sure the graph is well built and does not have hanging tensors or node.
 5. Build the cudnn operation graph. This step lowers the graph into cudnn dialect.
 6. Create the execution plan, based on the heuristics type of your choice.
-7. [Optional] Check support of the operation graph.
+7. Check support of the operation graph.
 8. [Optional] Filter out the plans by your custom criteria (Optional).
 9. Build (one or all) the execution plans.
-10. [Optional] Run autotuning on the filter plan (Optional).
+10. [Optional] Run autotuning on the filtered plan (Optional).
 11. Execute the graph with the relevant data pointers.
 
 ## APIs
@@ -48,7 +53,7 @@ FE v1.0 API follows a functional style of building a graph. Operations take in i
 | [Scale dot product attention FP8](docs/operations/Attention.md)          | sdpa_fp8<br> SDPA_fp8_attributes                     | sdpa_fp8                                                                                         |
 | [Scale dot product attention backward FP8](docs/operations/Attention.md) | sdpa_fp8_backward<br> SDPA_fp8_backward_attributes   | sdpa_fp8_backward                                                                                |
 
-### Create Graph
+### Creating the Graph
 Instantiate an object of class `cudnn_frontend::graph::Graph` which will house tensors and operations.  
 
 Optional graph level attributes can be set on the object:
@@ -71,53 +76,53 @@ Tensor attributes is a lightweight structure with setters for each attribute.
 - `cudnn_frontend::graph::Tensor_attributes& set_reordering_type(cudnn_frontend::TensorReordering_t)`
 - `cudnn_frontend::graph::Tensor_attributes& set_name(std::string&)`
 
-### Define Operations
+### Defining Operations
 Operations take in mandatory input tensor via positional arguments. Optional input tensors are provided using corresponding setters in operation attributes. 
 
 Operations return an ordered array of output tensors. Any optional outputs if not present will have their shared pointers pointing to `std::nullptr`.
 
 Please looks at [operations](#Operations) section for more details. 
 
-### Validate graph
+### Validating the Graph
 Validate API ensures API usage is sound, checks against dangling tensors, etc.
 Internally, any unspecified properties like dimensions, strides, etc are inferred.
 
 ```
 cudnn_frontend::error_t cudnn_frontend::graph::Graph::validate()
 ```
 
-### Build cudnn backend graph
+### Building the Backend Graph
 This method creates cudnn backend descriptors for all constituents of the graph.
 
 ```
 cudnn_frontend::error_t cudnn_frontend::graph::Graph::build_operation_graph(cudnnHandle_t handle)
 ```
 
-### Create Execution plans
+### Creating the Execution Plan
 This method internally queries the heuristics for engine configs for the given heuristics modes.
 
 ```
 cudnn_frontend::error_t cudnn_frontend::graph::Graph::get_execution_plans(std::vector<heur_mode_t>)
 ```
 
-### Get execution plan count
+### Getting the Execution Plan Count
 This method returns the number of execution plans returned by cudnn heuristics. Each plan gets an index from 0 to #plans-1, with 0 having top priority.
 
 ```
 cudnn_frontend::int64_t
 cudnn_frontend::Graph::get_execution_plan_count() const;
 ```
 
-### Check graph support
+### Checking Graph Support
 This method guarantees that executing the graph using plans queried will succeed.
 
 ```
 cudnn_frontend::error_t cudnn_frontend::graph::Graph::check_support(cudnnHandle_t h);
 ```
 
-### Build plans
+###  Building the Execution Plan
 
-This function builds execution plans queried with `create_execution_plan(...)`` API.
+This function builds execution plans queried with `create_execution_plan(...)` API.
 
 There are two flavours of this API:
 
@@ -140,10 +145,7 @@ cudnn_frontend::Graph::build_plan_at_index(
     int64_t plan_index
 );
 ```
-
-
-
-### Filter plans (optional)
+### Filtering Plans (Optional)
 Users can filter plans on numerical, behavioral notes, or plans that do not provide desired functional correctness.
 
 ```
@@ -155,15 +157,15 @@ cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_behavior_no
 cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_workspace_greater_than(int64_t const workspace);
 ```
 
-### Autotune
+### Autotuning
 
 Autotuning provides a way to execute different execution plans for a given graph and measure their relative performance under run time conditions.
 This generally helps validate and improve upon the results provided by the heuristics. Please refer to [samples](samples/cpp/autotuning.cpp)
 
-### Execute
-Executing graph requires device pointers to all input output tensors and a user allocated device workspace pointer.
+### Executing the Graph
+Executing the graph requires device pointers to all input output tensors and a user allocated device workspace pointer.
 
-Two flavours of execute exists, corresponding to `build_plans(...)`` API.
+Two flavours of execute exists, corresponding to `build_plans(...)` API.
 
 This API already has a candidate execution plan set. Candidate execution plan get internally set either:
 - if build_policy_t::HEURISTIC_CHOICE is used, or

diff --git a/README.md b/README.md
@@ -15,7 +15,9 @@ In FE v1.0 API, users can describe multiple operations that form subgraph throug
 Additionally, FE v1.0 API provides python bindings to all API through pybind11. It is recommended that new users of cuDNN start with the frontend v1.0 API. See `samples/cpp` and `samples/python` for more details on its usage.
 
 ## Usage
-In order to include the entire library, include the cudnn_frontend header file `include/cudnn_frontend.h` into your compilation unit.
+For c++ users, in order to include the entire library, include the cudnn_frontend header file `include/cudnn_frontend.h` into your compilation unit.
+
+For Python users, run `import cudnn`
 
 ## Build:
 
@@ -31,33 +33,30 @@ cudnn can be installed from
 Minimum python version needed 3.6
 The python binding compilation requires development package which can be installed by running `apt-get install python-dev`.
 
-To run the python samples, additionally, you will need the following python packages:
-- pytest
-- torch
-- jupyter
-
+To run the Python samples, you will need the dependencies mentioned in `requirements.txt`. This can be be installed by running:
+`pip install -r requirements.txt`
 
 ### Python API
 
+#### pip wheel installation
+
+Download the pip wheel corresponding to your python installation.
+
+```
+pip install nvidia_cudnn_frontend
+```
+
 #### Source installation:
 Install FE python API by running:
 ```
-pip install git+https://github.com/NVIDIA/cudnn-frontend.git
+pip install -v git+https://github.com/NVIDIA/cudnn-frontend.git
 ```
 
 Above command picks cuda and cudnn from default system paths.
 
 To provide a custom CUDA installation path, use environment variable: `CUDAToolkit_ROOT`.  
 To provide a custom CUDNN installation path, use environment variable: `CUDNN_PATH`.
 
-#### pip wheel installation
-
-Download the pip wheel corresponding to your python installation.
-
-```
-pip install nvidia_cudnn_frontend-1.2.0-*.whl
-```
-
 #### Checking the installation
 To test whether installation is successful, run:
 ```
@@ -66,15 +65,14 @@ pytest test/python_fe
 
 NOTE: Only v1.0 API is exposed via python bindings.
 
-
 ### C++ API
 
 C++ API is header only library.
 
 The root CMakeLists.txt can be used as reference to include the cudnn_frontend in your project's build system.
 
 #### Building samples
-The following compilation steps are only required for building the samples and/or python bindings.
+The following compilation steps are only required for building the samples.
 
 Provide CUDA installation path according to: https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html