Tensor cores used only for fp16 in interleaved multihead attention #17994

blchu · 2020-04-07T21:15:20Z

Description

Fixed issue where fp32 inputs use tensor cores for the interleaved multihead attention operators, resulting in lower precision calculations and potential reduction in accuracy.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Set interleaved multihead attention GEMM default to not use tensor cores, and only use if input data type is fp16
No longer checks for tensor input shape divisibility by 8

Comments

mxnet-bot · 2020-04-07T21:15:22Z

Hey @blchu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, unix-cpu, edge, centos-gpu, miscellaneous, windows-gpu, clang, sanity, unix-gpu, website, windows-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

leezu · 2020-04-08T18:39:06Z

Is this tested on cuda architectures < 5?

blchu · 2020-04-14T05:01:02Z

I've tested on cuda architecture < 5 (specifically, k80), there's no issue running the operator

blchu · 2020-04-14T18:48:03Z

@mxnet-bot run ci [all]

ptrendx · 2020-04-14T20:22:28Z

@ChaiBapchya is something happening to CI right now? bot did not trigger the CI. Thanks!

leezu · 2020-04-14T20:35:02Z

@mxnet-bot run ci [all]

leezu · 2020-04-14T20:38:21Z

@blchu can you rebase on master and force push?

…iction

ptrendx · 2020-04-15T17:55:11Z

@mxnet-bot run ci [unix-cpu, unix-gpu]

mxnet-bot · 2020-04-15T17:55:21Z

Jenkins CI successfully triggered : [unix-gpu, unix-cpu]

ptrendx · 2020-04-15T23:34:17Z

@mxnet-bot run ci [unix-cpu]

Download of cifar failed in the Perl test.

mxnet-bot · 2020-04-15T23:34:24Z

Jenkins CI successfully triggered : [unix-cpu]

…iction (apache#17994) (cherry picked from commit afae030)

…iction (#17994) (#18085) (cherry picked from commit afae030)

…iction (apache#17994)

ptrendx requested review from ptrendx and DickJC123 April 7, 2020 22:39

ptrendx approved these changes Apr 14, 2020

View reviewed changes

No tensor cores for fp32 interleaved attention, remove div by 8 restr…

9a99b87

…iction

blchu force-pushed the interleaved_fp32_fix branch from 8862691 to 9a99b87 Compare April 14, 2020 21:53

ptrendx merged commit afae030 into apache:master Apr 16, 2020

fhieber mentioned this pull request Apr 16, 2020

[WIP] Sockeye 2 Performance Optimizations awslabs/sockeye#752

Closed

8 tasks

blchu added a commit to blchu/incubator-mxnet that referenced this pull request Apr 16, 2020

No tensor cores for fp32 interleaved attention, remove div by 8 restr…

efba4bc

…iction (apache#17994) (cherry picked from commit afae030)

leezu pushed a commit that referenced this pull request Apr 16, 2020

No tensor cores for fp32 interleaved attention, remove div by 8 restr…

8cfc64a

…iction (#17994) (#18085) (cherry picked from commit afae030)

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020

No tensor cores for fp32 interleaved attention, remove div by 8 restr…

ba3ccde

…iction (apache#17994)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor cores used only for fp16 in interleaved multihead attention #17994

Tensor cores used only for fp16 in interleaved multihead attention #17994

blchu commented Apr 7, 2020 •

edited

Loading

mxnet-bot commented Apr 7, 2020

leezu commented Apr 8, 2020

blchu commented Apr 14, 2020

blchu commented Apr 14, 2020

ptrendx commented Apr 14, 2020

leezu commented Apr 14, 2020

leezu commented Apr 14, 2020

ptrendx commented Apr 15, 2020

mxnet-bot commented Apr 15, 2020

ptrendx commented Apr 15, 2020

mxnet-bot commented Apr 15, 2020

Tensor cores used only for fp16 in interleaved multihead attention #17994

Tensor cores used only for fp16 in interleaved multihead attention #17994

Conversation

blchu commented Apr 7, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Apr 7, 2020

leezu commented Apr 8, 2020

blchu commented Apr 14, 2020

blchu commented Apr 14, 2020

ptrendx commented Apr 14, 2020

leezu commented Apr 14, 2020

leezu commented Apr 14, 2020

ptrendx commented Apr 15, 2020

mxnet-bot commented Apr 15, 2020

ptrendx commented Apr 15, 2020

mxnet-bot commented Apr 15, 2020

blchu commented Apr 7, 2020 •

edited

Loading