-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CRS matrix sort, instead of Kokkos::sort on each row #1553
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brian-kelley Looks good to me. Thanks for fixing this.
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: KokkosKernels_PullRequest_GCC930_Light_Tpls_GCC930
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_CUDA11_CUDA11_LayoutRight
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_GCC1020
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_GCC1020_Light_LayoutRight
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_GCC720
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_INTEL19
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_CLANG1001
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_CLANG13CUDA10
Jenkins Parameters
Using Repos:
Pull Request Author: brian-kelley |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: KokkosKernels_PullRequest_GCC930_Light_Tpls_GCC930
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_CUDA11_CUDA11_LayoutRight
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_GCC1020
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_GCC1020_Light_LayoutRight
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_GCC720
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_INTEL19
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_CLANG1001
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_CLANG13CUDA10
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ vqd8a lucbv ]! |
Status Flag 'Pull Request AutoTester' - Pull Request MUST BE MERGED MANUALLY BY Project Team - This Repo does not support Automerge |
* origin/develop: (177 commits) cm_test_all_sandia: fix base module list cm_test_all_sandia: updates and cleanup Remove src subdir TeamGemvInternal: Update headers and namespace in KokkosBatched_Gemv_Decl.hpp apply clang-format change to device_type scripts/cm_test_all_sandia: Update openblas tpl sparse/unit_test: Disable spmv_mv_heavy for all A64FX builds Use CRS matrix sort, instead of Kokkos::sort on each row (kokkos#1553) Avoid the SIMD code branch if the batched size is not a multiple of the vector length SYCL: Fix linking with ze_loader in Trilinos fix the failing tests cm_test_all_sandia: testing updates Fix warnings Use static_cast<double> instead of 1. * Update the batched sparse kernels to add SIMD data types and options for the temporary variable allocations. Update default argument ROCm version 4.5 -> 5.2 in Docker image Fix HIP nightly build to satisfy new Kokkos minimum requirements blas: apply clang-format-8 ...
spiluk_symbolic needs to sort the rows of the L and U graphs. Instead of using Kokkos::sort on each row (since this is meant for device-wide sorts of large arrays), use the KokkosSparse::sort_crs_graph function executing on host.
This fixes trilinos/Trilinos#11041. The issue appeared to come from Kokkos because the default behavior of Kokkos::sort changed recently. In 3.6.1, given a UVM space view it dispatched to std::sort because it's host-accessible. In 3.7, In was dispatching to Kokkos bin sort on the GPU. This is definitely the way it should be in general given how Kokkos::sort is intended to be used, but it made this case of sorting many small arrays a lot slower. Especially when those small arrays live in UVM that's resident in host memory.
For biggest local matrix on abnormal energy surrogate problem (the one from Trilinos issue), here are the timings just for sorting L and U a few different ways:
Kokkos::sort each row, with force kokkos=true (current behavior): 169.452
My radix sort_crs_matrix, all on Serial/host (this PR): 3.42121
My bitonic sort_crs_matrix on V100, but not including the transfer back to host: 3.3849
std::sort each row in a normal for loop: 1.92318
This data suggests a follow-up change to sort_crs_graph, that uses std::sort instead of my radix sort where possible. That can come later though - this PR fixes the big slowdown.