Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grad scaler unit test fails for unmodified source #1296

Open
CoreyGiles opened this issue Mar 19, 2025 · 3 comments
Open

Grad scaler unit test fails for unmodified source #1296

CoreyGiles opened this issue Mar 19, 2025 · 3 comments

Comments

@CoreyGiles
Copy link

I am working on some new features and when running the R unit tests, I noticed that the grad scaler unit test failed. As my code hadn't touched that part of the code, I checked whether it works for unmodified source/package and I get the same failure.

The value I get seems to depend on the type of GPU in use. Multiple nvidia A30 reproducibly returns 1.006097, but a nvidia T4 returns 1.004341.

I'm wondering if this is specific to my systems or a known issue?

── Failed tests ─────────────────────────────────────────
Failure (test-autocast.R:236:3): grad scalers work correctly
sprintf("%1.6f", loss$item()) not equal to sprintf("%1.6f", 1.003786).
1/1 mismatches
x[1]: "1.006097"
y[1]: "1.003786"

[ FAIL 1 | WARN 0 | SKIP 16 | PASS 3003 ]

CUDA version 12.4
Driver Version: 565.57.01
I've tried both R 4.2.3 and 4.4 (both with openblas).

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] torch_0.14.2 testthat_3.2.1.1

loaded via a namespace (and not attached):
[1] remotes_2.5.0 rematch2_2.1.2 purrr_1.0.2
[4] diffobj_0.3.5 colorspace_2.1-0 vctrs_0.6.5
[7] miniUI_0.1.1.1 usethis_2.2.3 htmltools_0.5.8.1
[10] utf8_1.2.4 rlang_1.1.3 pkgbuild_1.4.4
[13] urlchecker_1.0.1 later_1.3.2 pillar_1.9.0
[16] glue_1.7.0 withr_3.0.0 waldo_0.5.2
[19] bit64_4.0.5 sessioninfo_1.2.2 lifecycle_1.0.4
[22] stringr_1.5.1 munsell_0.5.1 mvtnorm_1.2-4
[25] devtools_2.4.5 htmlwidgets_1.6.4 evaluate_0.23
[28] memoise_2.0.1 callr_3.7.6 fastmap_1.1.1
[31] httpuv_1.6.15 ps_1.7.6 safetensors_0.1.2
[34] fansi_1.0.6 Rcpp_1.0.12 xtable_1.8-4
[37] promises_1.3.0 scales_1.3.0 cachem_1.0.8
[40] coro_1.0.4 desc_1.4.3 pkgload_1.3.4
[43] jsonlite_1.8.8 mime_0.12 fs_1.6.4
[46] bit_4.0.5 brio_1.1.5 digest_0.6.35
[49] stringi_1.8.4 processx_3.8.4 shiny_1.8.1.1
[52] numDeriv_2016.8-1.1 rprojroot_2.0.4 cli_3.6.2
[55] tools_4.2.3 magrittr_2.0.3 tibble_3.2.1
[58] profvis_0.3.8 crayon_1.5.2 pkgconfig_2.0.3
[61] ellipsis_0.3.2 rstudioapi_0.16.0 R6_2.5.1
[64] compiler_4.2.3

@dfalbel
Copy link
Member

dfalbel commented Mar 19, 2025

Ohh weird. I'd need to investigate, but I wouldn't be surprised if it's possible that values change slightly depending on GPU type - which can trigger different codepaths. Specially with AMP, some GPU's immplement support for it at the hardware level.

@CoreyGiles
Copy link
Author

Testing on a fresh install with a RTX 3070 (WIndows 11; CUDA 12.4; Torch 0.14.2), I get another different answer "1.002941".
I take this to mean I shouldn't be concerned about this and can go ahead with some pull requests?

@dfalbel
Copy link
Member

dfalbel commented Mar 20, 2025

Yes, please ignore this for now :) We can treat this as a separate bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants