Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

CURAND_STATUS_PREEXISTING_FAILURE with v2.0.1 but not v1.7.3 #682

Closed
marius311 opened this issue Apr 15, 2020 · 8 comments
Closed

CURAND_STATUS_PREEXISTING_FAILURE with v2.0.1 but not v1.7.3 #682

marius311 opened this issue Apr 15, 2020 · 8 comments
Labels

Comments

@marius311
Copy link
Contributor

I recently upgraded from 1.7.3 to 2.0.1 and started seeing this error sporadically in my code, sometimes taking a ~minute of running, but always eventually hitting it. Unfortunately I'm unable to come up with a MWE, but I can reliably reproduce this on my system, including switching back and forth between 1.7.3 and 2.0.1 and seeing it appear / disspear.

The code is doing fairly standard I think manipulations of CuArrays (no custom kernels). I have Julia 1.4, NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2, and

  • Produces error: CuArrays v2.0.1, GPUArrays v3.1.0, CUDAapi v4.0.0, CUDAdrv v6.2.2, CUDAnative v3.0.4
  • No error: CuArrays v1.7.3, GPUArrays v2.0.1, CUDAapi v3.1.0, CUDAdrv v6.0.0, CUDAnative v2.10.2

and stacktrace (which I see maybe 80% of the time, the other 20% it just seems to hang):

CURANDError: preexisting failure on library entry (code 202, CURAND_STATUS_PREEXISTING_FAILURE)
throw_api_error at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/error.jl:53
macro expansion at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/error.jl:64 [inlined]
curandGenerateSeeds at /global/homes/m/marius/.julia/packages/CUDAapi/XuSHC/src/call.jl:93
seed! at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/random.jl:46 [inlined]
seed! at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/random.jl:44 [inlined]
#123 at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/CURAND.jl:35 [inlined]
get! at ./abstractdict.jl:663
generator at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/CURAND.jl:33
#randn!#102 at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/random.jl:167
randn! at /global/homes/m/marius/.julia/packages/CuArrays/e8PLr/src/rand/random.jl:167 [inlined]
white_noise at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flat_s0.jl:95 [inlined]
# everything below here my code

Any ideas what may have changed that could be causing this?

Would it be helpful if I bisect to the exact commit? Or maybe an expert can just guess what's going on from here?

@marius311 marius311 added the bug label Apr 15, 2020
@maleadt
Copy link
Member

maleadt commented Apr 15, 2020

Could you add a call to CUDAdrv.synchronize() before the failing CURAND API call, e.g. in curandGenerateSeeds, to see if we can capture that preexisting failure?

@marius311
Copy link
Contributor Author

I tried both

@checked function curandGenerateSeeds(generator)
    initialize_api()
    CUDAdrv.synchronize()
    @runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
                   (curandGenerator_t,),
                   generator)
end

and

@checked function curandGenerateSeeds(generator)
    CUDAdrv.synchronize()
    initialize_api()
    @runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
                   (curandGenerator_t,),
                   generator)
end

and I still get the same error / stack trace, although anecdotally it seems like it takes a little longer to trigger (might just be in my head). Is that what you meant?

@maleadt
Copy link
Member

maleadt commented Apr 15, 2020

Yes, but sadly it doesn't catch anything. I wonder why CURAND thinks there's a preexisting failure then.

Bisecting would be useful. Due to the coupling between CuArrays/CUDAnative/GPUArrays you'll probably have to use the Manifest that's part of CuArrays (only a few commits don't work, you can bisect skip those).

@marius311
Copy link
Contributor Author

Ok, bisected it to this being the first bad commit: 65a35b1

I checked a couple of times and I'm pretty sure this is it.

I'm using the Manifest like you suggested, so the breakdown is:

  • bad - CuArrays 65a35b1, CUDAapi v4.0.0, CUDAdrv v6.2.0, CUDAnative v3.0.0

  • good - CuArrays 138ece7, CUDAapi v4.0.0, CUDAdrv v6.2.0, CUDAnative v2.10.2 #58c6755

I notice that whenever I switch between these two commit I get

Building the CUDAnative run-time library for your sm_70 device, this might take a while...

which may be relevant.

@maleadt
Copy link
Member

maleadt commented Apr 16, 2020

Hmm, that doesn't help much. Are you using multiple threads or tasks?

@marius311
Copy link
Contributor Author

My code is single threaded, and can run in a one-MPI-process-per-GPU configuration. I mentioned above sometimes it hangs intsead of giving me the CURAND_STATUS_PREEXISTING_FAILURE error, but based on your comment / that bisect I ran my code with a single MPI process, and it looks like in this case its just always hanging. Maybe the CURAND_STATUS_PREEXISTING_FAILURE is a red-herring / side-effect of the real issue?

With a single process, I reproduced the hang about 5 times (with the "bad" versions from above), each time I get this identical stack track if I just kill it:

free at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:393
unknown function (ip: 0x2aac1ff19ad2)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:218 [inlined]
macro expansion at ./util.jl:234 [inlined]
free at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:217 [inlined]
_unsafe_free! at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:51
unsafe_free! at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:40
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
run_finalizer at /global/u1/m/marius/src/julia-1.4/src/gc.c:277
jl_gc_run_finalizers_in_list at /global/u1/m/marius/src/julia-1.4/src/gc.c:363
run_finalizers at /global/u1/m/marius/src/julia-1.4/src/gc.c:391 [inlined]
run_finalizers at /global/u1/m/marius/src/julia-1.4/src/gc.c:370
jl_gc_collect at /global/u1/m/marius/src/julia-1.4/src/gc.c:3124
maybe_collect at /global/u1/m/marius/src/julia-1.4/src/gc.c:827 [inlined]
jl_gc_pool_alloc at /global/u1/m/marius/src/julia-1.4/src/gc.c:1142
jl_gc_alloc_ at /global/u1/m/marius/src/julia-1.4/src/julia_internal.h:246 [inlined]
_new_array_ at /global/u1/m/marius/src/julia-1.4/src/array.c:106 [inlined]
_new_array at /global/u1/m/marius/src/julia-1.4/src/array.c:162 [inlined]
jl_alloc_array_1d at /global/u1/m/marius/src/julia-1.4/src/array.c:433
Array at ./boot.jl:405 [inlined]
rehash! at ./dict.jl:193
_setindex! at ./dict.jl:367 [inlined]
setindex! at ./dict.jl:388
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:384 [inlined]
macro expansion at ./lock.jl:183 [inlined]
alloc at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:383
unknown function (ip: 0x2aac1fe7fcb5)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:180 [inlined]
macro expansion at ./util.jl:234 [inlined]
alloc at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:179 [inlined]
CuArray at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:107
CuArray at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:115 [inlined]
similar at ./abstractarray.jl:671 [inlined]
similar at ./abstractarray.jl:670 [inlined]
similar at /global/u1/m/marius/work/s4/dev/CuArrays/src/broadcast.jl:11 [inlined]
copy at ./broadcast.jl:840
materialize at ./broadcast.jl:820
copy at /global/homes/m/marius/.julia/packages/GPUArrays/QDGmr/src/host/abstractarray.jl:173 [inlined]
unsafe_execute! at /global/u1/m/marius/work/s4/dev/CuArrays/src/fft/fft.jl:412 [inlined]
mul! at /global/u1/m/marius/work/s4/dev/CuArrays/src/fft/fft.jl:449 [inlined]
Fourier at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flat_s0.jl:74 [inlined]
Basislike at /global/homes/m/marius/work/s4/dev/CMBLensing/src/generic.jl:56
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
Ð! at /global/homes/m/marius/work/s4/dev/CMBLensing/src/generic.jl:62
v! at /global/homes/m/marius/work/s4/dev/CMBLensing/src/lenseflow.jl:145
unknown function (ip: 0x2aac5fcebc0e)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
RK4Solver at /global/homes/m/marius/work/s4/dev/CMBLensing/src/numerical_algorithms.jl:25
unknown function (ip: 0x2aac5fce911e)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
odesolve at /global/homes/m/marius/work/s4/dev/CMBLensing/src/numerical_algorithms.jl:53
unknown function (ip: 0x2aac5fce85aa)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
back at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flowops.jl:40
#187#back at /global/homes/m/marius/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:59 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
#175 at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/lib/lib.jl:170 [inlined]
#344#back at /global/homes/m/marius/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49 [inlined]
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:70 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:53 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
unknown function (ip: 0x2aac5fcdc6e3)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
#460 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:286 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
#36 at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface.jl:36
unknown function (ip: 0x2aac5fcdb043)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
gradient at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface.jl:45
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
do_apply at /global/u1/m/marius/src/julia-1.4/src/builtins.c:643
jl_f__apply_latest at /global/u1/m/marius/src/julia-1.4/src/builtins.c:693
#invokelatest#1 at ./essentials.jl:712 [inlined]
invokelatest at ./essentials.jl:711 [inlined]
#419#420 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/util.jl:272 [inlined]
#419 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/util.jl:272 [inlined]
#418 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:14
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:25 [inlined]
macro expansion at /global/homes/m/marius/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:717 [inlined]
#symplectic_integrate#414 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:23
unknown function (ip: 0x2aac5fc69baa)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
symplectic_integrate##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:17
unknown function (ip: 0x2aac5fc69399)
symplectic_integrate##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:17
unknown function (ip: 0x2aac5fc69195)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:284 [inlined]
macro expansion at ./util.jl:234 [inlined]
#458 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:277
iterate at ./generator.jl:47 [inlined]
_collect at ./array.jl:678
collect_similar at ./array.jl:607
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
map at ./abstractarray.jl:2072
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
#sample_joint#449 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:249
unknown function (ip: 0x2aac5c096c3f)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2158 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
sample_joint##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:176
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2158 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
do_call at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:369
eval_value at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:458
eval_stmt_value at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:409 [inlined]
eval_body at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:803
jl_interpret_toplevel_thunk at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:911
jl_toplevel_eval_flex at /global/u1/m/marius/src/julia-1.4/src/toplevel.c:814
jl_toplevel_eval_flex at /global/u1/m/marius/src/julia-1.4/src/toplevel.c:764
slurmstepd: error: *** STEP 550402.8 ON cgpu03 CANCELLED AT 2020-04-16T14:18:58 ***
srun: Terminating job step 550402.8

@maleadt
Copy link
Member

maleadt commented Apr 17, 2020

Ah, so even setindex can trigger the GC... That would explain this deadlock, which is a separate issue #685, but not the CURAND failures. I'll have a look at fixing the former, for which this backtrace is very helpful.

@marius311
Copy link
Contributor Author

This appears to be fixed for me on 2.2.0, presumably by the referenced issue above. Guessing the CURAND thing was just a random side-effect.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants