-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Flaky test MKLDNN_BASE.MKLDNNSum #11998
Comments
@jinhuang415 please help take a look for this issue. |
@jinhuang415 - bounce |
We have reproduced this issue in local luster, looks only happens very occasionally, will update once there is further findings. |
@marcoabreu @zheng-da @azai91 @pengzhao-intel
it indicates the related NDArray's shandle is nullptr but delay_alloc is set to false, usually delay_alloc will only be set to false after shandle is allocated, so most probably there is race condition here (this issue is very hard to reproduce, only happens occasionally), and I tried to print delay_alloc 2 times before
The detailed function calls is as below:
InitMKLDNNArray() happens in test main thread while CopyFromToImpl() happens asynchronously in Engine thread, so there may exist thread race condition here that InitMKLDNNArray() set delay_alloc to false and cause the CHECK failed for CopyFromToImpl(). A simple fix is to add barrier to wait for Copy operation to finish before further operations to get rid of the race condition. The proposed change is as below:
|
In short, the CPP test case mixed the usage of MXNET OP and C/C++ function where the MXNET OP is executed with async mode and C function will run directly. In concluding, this is NOT the bug in MKLDNN or dependency engine. |
Sorry, I'm currently swamped with high priority tasks and I don't have time to review your task or assist with inqueries. |
Fixed by #12080 |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1319/pipeline
Cpp: MKLDNN+GPU
The text was updated successfully, but these errors were encountered: