-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849
Conversation
use bestmove_is_sent_ for Search::IsSearchActive() (LeelaChessZero#502)
get latest
get latest
get latest
get latest
Get latest
- replace all cudaMemcpyAsync used for loading weights with cudaMemcpy as source (in CPU memory) could be deleted before the async version of the function actually does the copy. - minor naming/style changes. - add comment explaining what the policy map layer does and how the layout conversion from CHW to HWC works.
get latest
get latest
- try NCHW layout and winograd alogirhtm for convolutions (same as what we use for fp32). - it's slower than NHWC/fp16 on GPUs with tensor cores, but should give some speedup on GP100 and TU11x GPUs.
Some benchmarks on GTX 1650
Its surprising that the fp16/nhwc path even works on GTX 1650. Maybe cudnn/cublas is just emulating it and that's why it's so slow. Even with nchw path, if 'TENSOR_OP_MATH' flag is enabled, it's still very slow (again likely because it has to emulate tensor cores somehow). Good news is fp16/nchw layout without the 'TENSOR_OP_MATH' is almost 2x faster than fp32. |
- not sure why Visual C works fine!
- GP100 (SM6.0) - GTX 16xx GPUs (unfortunately same sm 7.5 version so need a string compare)
default is auto-select (-1).
Use bool option instead of int and use IsDefault mechanism to check if the option was forced or not.
@@ -0,0 +1,2 @@ | |||
layers.cc | |||
lc0@exe.vcxproj -> C:\Ankan\git\ankan\lc0\build\.\lc0.exe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this file? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Likely some intermediate build file that accidentally got submitted. Will remove it.
Does the merged work apply only to the cards using the GP100, i.e., the Quadro GP100 and the Tesla P100? Can similar techniques apply to the other Pascal chips, in particular, GP102 chips like the Titan X (Pascal) and Titan Xp have? NVIDIA advertises some level of fp16 acceleration on those, but I don't know enough of the implementation to know the differences. If there's a path to accelerate performance on GP102 using similar techniques, please let me know and I'll open a feature request issue. |
Unfortunately no, other Pascal chips (gp102/gp104/gp106, etc) don't have support for fp16 math. They do support higher throughput int8 math but right now we don't have support in lc0 for int8 precision. |
5.3 support for fp16 is still missing for jetson. I tested just adding it
and it works fine to double the speeds it seems.
…On Tue, Sep 3, 2019, 9:56 PM Ankan Banerjee ***@***.***> wrote:
Unfortunately no, other Pascal chips (gp102/gp104/gp106, etc) don't have
support for fp16 math. They do support higher throughput int8 math but
right now we don't have support in lc0 for int8 precision.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#849?email_source=notifications&email_token=ADXIQNHGLZAJEHZ7AOXSHRDQH4IWNA5CNFSM4HMIZJKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD52CWKQ#issuecomment-527706922>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADXIQNFLLGDPPCSHHC72HGTQH4IWNANCNFSM4HMIZJKA>
.
|
Do you know whether using fp16 with CC 6.2 (jetson TX2) is also a performance gain? |
For cudnn-fp16 backend: For supporting GPUs without tensor cores (e.g: GP100 and GTX 16xx series).
TODO: