You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on some new features and when running the R unit tests, I noticed that the grad scaler unit test failed. As my code hadn't touched that part of the code, I checked whether it works for unmodified source/package and I get the same failure.
The value I get seems to depend on the type of GPU in use. Multiple nvidia A30 reproducibly returns 1.006097, but a nvidia T4 returns 1.004341.
I'm wondering if this is specific to my systems or a known issue?
── Failed tests ─────────────────────────────────────────
Failure (test-autocast.R:236:3): grad scalers work correctly
sprintf("%1.6f", loss$item()) not equal to sprintf("%1.6f", 1.003786).
1/1 mismatches
x[1]: "1.006097"
y[1]: "1.003786"
[ FAIL 1 | WARN 0 | SKIP 16 | PASS 3003 ]
CUDA version 12.4
Driver Version: 565.57.01
I've tried both R 4.2.3 and 4.4 (both with openblas).
sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS
Ohh weird. I'd need to investigate, but I wouldn't be surprised if it's possible that values change slightly depending on GPU type - which can trigger different codepaths. Specially with AMP, some GPU's immplement support for it at the hardware level.
Testing on a fresh install with a RTX 3070 (WIndows 11; CUDA 12.4; Torch 0.14.2), I get another different answer "1.002941".
I take this to mean I shouldn't be concerned about this and can go ahead with some pull requests?
I am working on some new features and when running the R unit tests, I noticed that the grad scaler unit test failed. As my code hadn't touched that part of the code, I checked whether it works for unmodified source/package and I get the same failure.
The value I get seems to depend on the type of GPU in use. Multiple nvidia A30 reproducibly returns 1.006097, but a nvidia T4 returns 1.004341.
I'm wondering if this is specific to my systems or a known issue?
CUDA version 12.4
Driver Version: 565.57.01
I've tried both R 4.2.3 and 4.4 (both with openblas).
The text was updated successfully, but these errors were encountered: