-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job crash in ROOT 6.22 IBs #30359
Comments
A new Issue was created by @Dr15Jones Chris Jones. @Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
This issue is intended to follow any crashes that happen in ROOT 6.22 IBs |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Is this reproducible? (i.e. Can we get a move precise stack trace as (so far) I can not infer where in this function it could crash (it is clearly in an inlined function .. question is 'which one'?)) |
This may be related to the issue in #23715, which also involved crashes in MultipleScatteringParametrisation. That issue was closed without an entirely satisfactory resolution. |
Should have added that we saw this crash in the default slc7_amd64_gcc820 IBs, which is ROOT 6.20. Stack trace:
|
Another this kind of crash from the default IBs (CMSSW_11_2_X_2020-09-29-1100, slc7_amd64_gcc820, ROOT 6.20/09) workflow 1000.0 step 2
|
There is another failure in the workflow 1000.0 step 2 in default IB CMSSW_11_2_X_2020-09-30-1100 (the IB in between, CMSSW_11_2_X_2020-09-30-1100, the test succeeded). This time there were two threads calling
|
There is another failure in the workflow 1000.0 step 2 in default IB CMSSW_11_2_X_2020-10-12-1500. |
There is another failure in the workflow 1000.0 step 2 in default IB CMSSW_11_2_X_2020-10-15-1100. Also here there are two threads calling the
|
There is another failure in the workflow 1000.0 step 2 in the slc7_amd64_gcc900 IB CMSSW_11_2_X_2020-10-20-2300. |
There is another failure in the workflow 1000.0 step2 in the default IB CMSSW_11_2_X_2020-10-21-1100 |
There is another failure in the workflow 1000.0 step2 in the default IB CMSSW_11_2_X_2020-10-23-1100 |
There is another failure in the workflow 1000.0 step2 in the cc8_amd64_gcc8 IB CMSSW_11_2_X_2020-10-27-2300
|
There is another failure in the workflow 1000.0 step 2 in the slc7_amd64_gcc10 IB CMSSW_11_2_X_2020-10-29-2300
|
Do we have a debug version of this same set of libraries/test running? If we do, do they also (sometimes) crashes in a similar way? |
@pcanal yes. Please use CMSSW_11_2_ROOT622_X_2020-10-30-0800 release. |
@pcanal , please note that |
@makortel , ROOT622 IBs are in good state now. We have not chnaged root but reverted a cmssw PR to get it in green state. Do you see any issues with root 6.22 IBs now? |
The ROOT 6.22 IBs themselves indeed look good now. The sporadic issue with 1000.0 step2 (that this issue appears really to be about in practice) is presumably still there, but can occur in any IB. |
After staring at it for awhile (and running valgrind, and running in a loop in gdb), while I've failed to reproduce the problem, and I don't have a theory why this only shows up in WF 1000.0, I suspect that the crash may be related to the thread local storage introduced in #29561 cmssw/MagneticField/VolumeBasedEngine/src/MagGeometry.cc Lines 23 to 30 in b9ea5aa
In valgrind I can see that the TLS is getting allocated in
We could have a race condition if the TLS initialization modifies the list that I reviewed the discussion of that PR, and its predecessor, and was surprised that no mention was made of Is there a reason that wasn't considered? Could we switch the |
I suppose the |
There is another
|
There is another
|
I can run the 4.17 example based on @makortel recipe but it does not (yet?) crash for me. I will try with valgrind and try 1000.0 too. |
I can reproduce the problem (i.e. valgrind told me it was misbehaving even-though it was not crashing). I am investigating. |
#32153 has some ASAN traces that might be related to the crash in |
note that we have reverted root to previously working commit (cms-sw/cmsdist#6430 + plus 3 commits needed for DD4Hep). This means only ROOT622 IBs is now using the latest root 6.22 patches branch changes. |
In ASAN IB, we see the following crash (workflow 4.17 step 5)
|
@smuzaffar Can you try out root-project/root#6873 ? It seems to fixed the problem in CMSSW and is almost ready to merge (the failures on Monday's last run are either unrelated or need to have a related commit in the test suite reverted). |
@pcanal , testing it via cms-sw/root#147 PR now |
@smuzaffar can we also add root-project/root#6850 as a fix for the |
@dan131riley , root-project/root#6850 is for root master branch. Is there any for root v6.22 branch? |
That's root-project/root#6877 |
thanks @Axel-Naumann , I am testing it cms-sw/root#148 now |
Thanks. What do your IBs say? |
Ping ^ - unless your off, Shahzad - then I'll just tag tomorrow without hearing back from you. |
@Axel-Naumann , things looks much better now. Few have couple of IBs with latest root v.6.22 changes (including cms-sw/root#147 and cms-sw/root#148) . PR tests and comparison also look good cms-sw/cmsdist#6416 . Our Special ROOT622 IBs have not shown any random errors yet. So from my side all is good. |
I have not seen strange things in ROOT622 IBs. |
yes since the integration of root v6.22.06 in 11.2.X and 11.3.X things are in much better state. |
We have not seen any of these problems for a while now, so maybe we could close the issue? @smuzaffar @Dr15Jones would you agree? |
yes agree |
+1 |
I’m fine with closing it. |
This issue is fully signed and ready to be closed. |
The text was updated successfully, but these errors were encountered: