-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation violation in RelVal wf 25234.911
step 2
#42470
Comments
A new Issue was created by @aandvalenzuela Andrea Valenzuela. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign simulation,reconstruction,geometry,upgrade FYI @cms-sw/hgcal-dpg-l2 |
New categories assigned: geometry,upgrade,reconstruction,simulation @mdhildreth,@mdhildreth,@AdrianoDee,@mandrenguyen,@Dr15Jones,@clacaputo,@srimanob,@makortel,@bsunanda,@civanch,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
To clarify: the issue itself is not in Geant4 but may depend on hits, which come from step1 with Geant4. |
@aandvalenzuela , are externals recompiled with TCMalloc? |
Externals and CMSSW packages are compiled independently of the allocator. The glibc/jemalloc/tcmalloc is loaded at runtime by |
Or maybe more precisely, the allocator visible at the compilation and static linking time is the glibc one, that is overridden at run time by |
Valgrind may be used to identify the root cause of this misbehavior. |
BTW: where those Eigen threads |
At least Tensorflow likes to start the Eigen threadpool, even if it doesn't use them (IIRC we didn't find a way to completely avoid the Eigen threadpool, although we probably have not checked again "recently"). |
@dan131riley reported a similar crash in #42669 in CMSSW_13_3_ROOT628_X_2023-08-27-2300. @aandvalenzuela Since the crash is seen in multiple IB flavors, could you remove the "[GEANT4]" from the issue title? |
25234.911
step 225234.911
step 2
done (@aandvalenzuela is away for few weeks) |
Similar crash happened again in |
Occurred again in CMSSW_14_0_SKYLAKEAVX512_X_2024-01-25-2300 |
New occurrence in CMSSW_14_1_X_2024-04-01-2300 (slc7_amd64_gcc12). See full log. |
New occurrence in CMSSW_14_1_X_2024-04-15-2300 for el8_amd64_gcc12 link |
Hi, |
I just skimmed this quickly and I am not sure this is actually the problem causing the seg fault but I noticed the following in TICLLayerTileProducer. There is a choice made using the boolean doNose_. We always call consumes for both choices and create the output product for both choices. That seems unnecessary. Those could be executed only for the selected choice. Independent of that, if https://cmssdt.cern.ch/lxr/source/RecoHGCal/TICL/plugins/TICLLayerTileProducer.cc |
In the default build LogDebug is disabled, but in the DBG IBs it is enabled. But in any case the issues you brought up should be addressed (@cms-sw/hgcal-dpg-l2) |
I'm going to go ahead and submit a PR with a fix for this shortly. Shouldn't take long to implement. Maybe the failures will go away and maybe not, but at least we can eliminate this as a possible source of the problem. I saw another failure today. If someone else has already submitted or is about to submit a fix for this, please let me know... |
I just submitted PR #44843 to fix the problem I mentioned above. I didn't verify that this is the source of the problem. We should monitor IBs to see if the problem recurs after this PR is merged. The argument against it being the problem is that the optimizer should entirely remove LogDebug lines during compilation. Then the problematic line would not be executed. We believe LogDebug was not enabled in the IBs. On the other hand, the arguments passed to the LogDebug input operator ( |
For the record, the latest failure was in the default production architecture for CMSSW_14_1 2024-04-24-1100. |
The PR with the change that might fix this went into CMSSW_14_1_X_2024-04-29-1100. If we see the problem again after that, then we'll know that was not the cause and have more work to do on this. |
@wddgit this failure happened again in CMSSW_14_1_GEANT4_X_2024-05-12-2300:
Full stack trace
|
New occurrence of this failure at CMSSW_14_1_CUDART_X_2024-05-15-2300 IBs:
|
New occurrence in CMSSW_14_0_X_2024-06-20-2300 IB
|
Hi @Dr15Jones |
Hello,
RelVal wf
25234.911
is failing due to segmentation violation in GEANT4 IBs (CMSSW_13_3_GEANT4_X_2023-08-02-2300) with the following stacktrace:I cannot reproduce the failure locally, but I am reporting to keep track of the failures specially since this failure has appeared as of the move to tcmalloc.
Thanks,
Andrea.
The text was updated successfully, but these errors were encountered: