[cudapoa] improving cudaPOA performance #552

r-mafi · 2020-09-01T22:49:33Z

a round of optimization consisting of reducing back register usage, hiding global memory access where applicable and reducing NW inner while loop iterations in an effort to improve compute-time and SOL metrics.
in cudapoa binary API, added a new option -s to allow managing allocated memory for adaptive score matrix

…-rev 0

…ith 1 predecessor and band_start > 4

…- rev3

…ecessor - rev3" rev3 change did not optimize much and even in some cases could a bit slow down, therefore reverted. This reverts commit 833ac3ad

…aGenomicsAnalysis into cudapoa_optimization

…uce long_score_board-rev 4

… matrix allocation

…score matrix allocation

…s where POA groups have the same number of reads- rev 4b

…op iterations- rev 5

… adding __align__ to Seq4T fails though

…ptive, static and full alignments are separate, reg count for static down to 71 from 83; rev 7

… by changing banded_score_matrix_size from int64_t to float, reduced 1 register! :) (from 79 to 78)

…ngle-thread work in updating vertical scores, removed set_and_get_first_column_score().

… loop. This reduced register count from 78 to 75.

…umn == 0 to -1, to get rid of it is a better solution! also replaced get_score() with get_score_adaptive() in nw_adaptive, better solution is to uify similar kernels

…onvince compiler finding a way to minimize register usage down to 72. It worked without any register spills. rev 8

…some cases (random) it can result in misaligned address error

…n while loop in cudapoa-full alignment - rev9

…aGenomicsAnalysis into cudapoa_optimization_v2

…is helps compiler perform more aggressive optimizations. When using unsigned type compiler can't perform such optimizations due to overflow check semantics.

…nce_to_head, although the latter did not waste any memory.

… to int32_t

cudapoa/src/cudapoa_kernels.cuh

cudapoa/src/cudapoa_nw_banded.cuh

… ScoreT to int32_t, reduced reg count down to 64

…tive-banded

…ptive-banded

… in nw-full to be consisten with the changes in banded versions, although register count remained the same, 64

…(to see the actual number of registers, launchbounds() should be commented)

…NELS_MAX_THREADS_PER_BLOCK to 1024, to enforce 64 register count

…aGenomicsAnalysis into cudapoa_optimization_v2

…n python

…s commits

r-mafi added 30 commits August 7, 2020 10:40

[cudapoa] moved initializing first column scores inside the main loop…

1068d16

…-rev 0

[cudapoa] avoid recomputing pred_idx- rev1

d18be7d

[cudapoa-optimization] avoid computing first column score for nodes w…

2dd0da8

…ith 1 predecessor and band_start > 4

[cudapoa-optimization] missed from previous commit!-rev2

af2f2c5

[cudapoa-optimization] using previous computed score for predecessor …

03dabac

…- rev3

Revert "[cudapoa-optimization] using previous computed score for pred…

10ba907

…ecessor - rev3" rev3 change did not optimize much and even in some cases could a bit slow down, therefore reverted. This reverts commit 833ac3ad

Merge branch 'dev-v0.5.0' of https://github.com/clara-parabricks/Clar…

2c3b651

…aGenomicsAnalysis into cudapoa_optimization

[cudapoa-optimization] moving pred_node_id up to see if that will red…

cfcba8b

…uce long_score_board-rev 4

[cudapoa-optimization] slight change in topsort

b46d7c0

[cudapoa] added a new option '-s' to determine size of adaptive score…

21a17d5

… matrix allocation

[cudapoa] more work on new option '-s' to determine size of adaptive …

512dde2

…score matrix allocation

[cudapoa] improved some cerr messages in cudapoa-bin and sample_cudapoa.

2c3abfc

[cudapoa] revised binning strategy. This change has no effect on case…

03dfe39

…s where POA groups have the same number of reads- rev 4b

[cudapoa] reorder updating scores in NW while loop to reduce while lo…

041e465

…op iterations- rev 5

[cudapoa] added __align__ to custom data type Score4; for some reason…

43f3456

… adding __align__ to Seq4T fails though

[cudapoa] prefetch node_id in backtracking loop- rev 6

9a0d94b

[cudapoa-optimization] minor changes, with no impact on perf.

5483750

[cudapoa-optimization] minor changes, with no impact on perf.

36c205e

[cudapoa-optimiztion] made MSA template arg

cec037e

[cudapoa-optimiztion] made BANDED template arg, reg count down to 84

300c563

[cudapoa-optimiztion] a small change, reducing registers by 1! :)

f6ac922

[cudapoa-optimiztion] made banding mode template, now the path of ada…

68b3dc3

…ptive, static and full alignments are separate, reg count for static down to 71 from 83; rev 7

[cudapoa-optimiztion] started minimizing register count for adaptive,…

522a825

… by changing banded_score_matrix_size from int64_t to float, reduced 1 register! :) (from 79 to 78)

[cudapoa-optimiztion] similar to changes in nw_banded, reduced the si…

6494668

…ngle-thread work in updating vertical scores, removed set_and_get_first_column_score().

[cudapoa-optimiztion] reordered updating thread cells in the NW while…

d4bd68c

… loop. This reduced register count from 78 to 75.

[cudapoa-optimiztion] prefetch node_id in backtracking phase

0f46113

[cudapoa-optimiztion] in nw_banded, changed annoying exception of col…

ac7e42f

…umn == 0 to -1, to get rid of it is a better solution! also replaced get_score() with get_score_adaptive() in nw_adaptive, better solution is to uify similar kernels

[cudapoa-optimiztion] minor fix

ffffeea

[cudapoa-optimiztion] added launch_bounds to generatePOAKernel() to c…

e16de9c

…onvince compiler finding a way to minimize register usage down to 72. It worked without any register spills. rev 8

[cudapoa] fixed misaligned address bug for Score4T when ScoreT is 32-bit

d2305b7

r-mafi added 5 commits August 31, 2020 18:52

[cudapoa-optimization] removed __align__(16) ScoreT4<int32_t>, as in …

ab8116f

…some cases (random) it can result in misaligned address error

[cudapoa-optimization] reverse order of updating score matrix cells i…

eed240e

…n while loop in cudapoa-full alignment - rev9

Merge branch 'dev-v0.6.0' of https://github.com/clara-parabricks/Clar…

9019e6e

…aGenomicsAnalysis into cudapoa_optimization_v2

[cudapoa-optimization] removed ToDo item in TopSort

10d1d7b

Merge branch 'dev-v0.6.0' of https://github.com/clara-parabricks/Clar…

fdcea82

…aGenomicsAnalysis into cudapoa_optimization_v2

r-mafi added enhancement New feature or request cudapoa GPU-based partial order alignment labels Sep 1, 2020

r-mafi self-assigned this Sep 1, 2020

r-mafi linked an issue Sep 1, 2020 that may be closed by this pull request

[cudapoa] reduce register count in cudapoa kernels #547

Closed

r-mafi requested a review from tijyojwad September 1, 2020 22:54

r-mafi added 6 commits September 10, 2020 11:06

[cudapoa-optimization] changed uint16_t for loop counters in CUDA, th…

3badd60

…is helps compiler perform more aggressive optimizations. When using unsigned type compiler can't perform such optimizations due to overflow check semantics.

[cudapoa-optimization] added __forceinline__ to NW device kernels

a5db579

[cudapoa] removed unused buffers outgoing_edge_weights and node_dista…

3a437cf

…nce_to_head, although the latter did not waste any memory.

[cudapoa-optimization] minor cleanup, removing some unused args

d2e5b68

[cudapoa-optimization] changing SizeT registers in nw_banded to int32_t

f9867f5

[cudapoa-optimization] changing most of ScoreT registers in nw_banded…

961f243

… to int32_t

tijyojwad suggested changes Sep 10, 2020

View reviewed changes

cudapoa/src/cudapoa_kernels.cuh Outdated Show resolved Hide resolved

cudapoa/src/cudapoa_nw_banded.cuh Show resolved Hide resolved

r-mafi added 10 commits September 10, 2020 18:58

[cudapoa-optimization] changing gap, match, mismatch score types from…

312dd5f

… ScoreT to int32_t, reduced reg count down to 64

[cudapoa-optimization] changing SizeT registers to int32_t in nw-adap…

95065f8

…tive-banded

[cudapoa-optimization] changing ScoreT registers to int32_t in nw-ada…

9fae1e1

…ptive-banded

[cudapoa-optimization] changing ScoreT and SizeT registers to int32_t…

5f4af12

… in nw-full to be consisten with the changes in banded versions, although register count remained the same, 64

[cudapoa-optimization] more (minor) changes affecting register count …

7886ff3

…(to see the actual number of registers, launchbounds() should be commented)

[cudapoa-optimization] addressed PR comments, also changed GW_POA_KER…

730dda1

…NELS_MAX_THREADS_PER_BLOCK to 1024, to enforce 64 register count

Merge branch 'dev-v0.6.0' of https://github.com/clara-parabricks/Clar…

670b464

…aGenomicsAnalysis into cudapoa_optimization_v2

[cudapoa] removed stream arg from CudaPoaBatch in a couple of cases i…

c8a0c17

…n python

[cudapoa] minor fix

c2b82a6

[cudapoa-optimization] fixed a bug in banded NW introduced in previou…

7d14cbe

…s commits

r-mafi requested a review from tijyojwad September 11, 2020 19:17

tijyojwad approved these changes Sep 14, 2020

View reviewed changes

tijyojwad merged commit 0e9a6f3 into NVIDIA-Genomics-Research:dev-v0.6.0 Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudapoa] improving cudaPOA performance #552

[cudapoa] improving cudaPOA performance #552

r-mafi commented Sep 1, 2020 •

edited

Loading

[cudapoa] improving cudaPOA performance #552

[cudapoa] improving cudaPOA performance #552

Conversation

r-mafi commented Sep 1, 2020 • edited Loading

r-mafi commented Sep 1, 2020 •

edited

Loading