Grad scaler unit test fails for unmodified source #1296

CoreyGiles · 2025-03-19T06:42:47Z

I am working on some new features and when running the R unit tests, I noticed that the grad scaler unit test failed. As my code hadn't touched that part of the code, I checked whether it works for unmodified source/package and I get the same failure.

The value I get seems to depend on the type of GPU in use. Multiple nvidia A30 reproducibly returns 1.006097, but a nvidia T4 returns 1.004341.

I'm wondering if this is specific to my systems or a known issue?

── Failed tests ─────────────────────────────────────────
Failure (test-autocast.R:236:3): grad scalers work correctly
sprintf("%1.6f", loss$item()) not equal to sprintf("%1.6f", 1.003786).
1/1 mismatches
x[1]: "1.006097"
y[1]: "1.003786"

[ FAIL 1 | WARN 0 | SKIP 16 | PASS 3003 ]

CUDA version 12.4
Driver Version: 565.57.01
I've tried both R 4.2.3 and 4.4 (both with openblas).

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] torch_0.14.2 testthat_3.2.1.1

loaded via a namespace (and not attached):
[1] remotes_2.5.0 rematch2_2.1.2 purrr_1.0.2
[4] diffobj_0.3.5 colorspace_2.1-0 vctrs_0.6.5
[7] miniUI_0.1.1.1 usethis_2.2.3 htmltools_0.5.8.1
[10] utf8_1.2.4 rlang_1.1.3 pkgbuild_1.4.4
[13] urlchecker_1.0.1 later_1.3.2 pillar_1.9.0
[16] glue_1.7.0 withr_3.0.0 waldo_0.5.2
[19] bit64_4.0.5 sessioninfo_1.2.2 lifecycle_1.0.4
[22] stringr_1.5.1 munsell_0.5.1 mvtnorm_1.2-4
[25] devtools_2.4.5 htmlwidgets_1.6.4 evaluate_0.23
[28] memoise_2.0.1 callr_3.7.6 fastmap_1.1.1
[31] httpuv_1.6.15 ps_1.7.6 safetensors_0.1.2
[34] fansi_1.0.6 Rcpp_1.0.12 xtable_1.8-4
[37] promises_1.3.0 scales_1.3.0 cachem_1.0.8
[40] coro_1.0.4 desc_1.4.3 pkgload_1.3.4
[43] jsonlite_1.8.8 mime_0.12 fs_1.6.4
[46] bit_4.0.5 brio_1.1.5 digest_0.6.35
[49] stringi_1.8.4 processx_3.8.4 shiny_1.8.1.1
[52] numDeriv_2016.8-1.1 rprojroot_2.0.4 cli_3.6.2
[55] tools_4.2.3 magrittr_2.0.3 tibble_3.2.1
[58] profvis_0.3.8 crayon_1.5.2 pkgconfig_2.0.3
[61] ellipsis_0.3.2 rstudioapi_0.16.0 R6_2.5.1
[64] compiler_4.2.3

dfalbel · 2025-03-19T11:01:20Z

Ohh weird. I'd need to investigate, but I wouldn't be surprised if it's possible that values change slightly depending on GPU type - which can trigger different codepaths. Specially with AMP, some GPU's immplement support for it at the hardware level.

CoreyGiles · 2025-03-20T11:29:15Z

Testing on a fresh install with a RTX 3070 (WIndows 11; CUDA 12.4; Torch 0.14.2), I get another different answer "1.002941".
I take this to mean I shouldn't be concerned about this and can go ahead with some pull requests?

dfalbel · 2025-03-20T11:37:07Z

Yes, please ignore this for now :) We can treat this as a separate bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad scaler unit test fails for unmodified source #1296

Grad scaler unit test fails for unmodified source #1296

CoreyGiles commented Mar 19, 2025

dfalbel commented Mar 19, 2025

CoreyGiles commented Mar 20, 2025

dfalbel commented Mar 20, 2025

Grad scaler unit test fails for unmodified source #1296

Grad scaler unit test fails for unmodified source #1296

Comments

CoreyGiles commented Mar 19, 2025

dfalbel commented Mar 19, 2025

CoreyGiles commented Mar 20, 2025

dfalbel commented Mar 20, 2025