Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASAN] Relval failing with asan libaray load issue #43823

Closed
smuzaffar opened this issue Jan 31, 2024 · 13 comments · Fixed by #43826
Closed

[ASAN] Relval failing with asan libaray load issue #43823

smuzaffar opened this issue Jan 31, 2024 · 13 comments · Fixed by #43826

Comments

@smuzaffar
Copy link
Contributor

Since CMSSW_14_0_ASAN_X_2024-01-24-2300 we see many Relvals are failing with error [a]. this normally happens when an executable without any explicit linking to libasan.so load any shared library which is linked with libasan. I noticed that The relvals which are failing are using -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 (i.e HLT:@relval2026) option to cmsDriver. e.g. for wf 24834.0 step2 (which fails) the cmsDriver command is

cmsDriver.py step2  -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9  --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --prefix 'python2 /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 14400 '  --filein filelist:step1_dasquery.log --fileout file:step2.root  --suffix "-j JobReport2.xml "  --nThreads 4

while for workflow 1.0 step2 comamnd

cmsDriver.py step2  -s DIGI,L1,DIGI2RAW,HLT:@fake --datatier GEN-SIM-RAW --eventcontent RAWSIM --conditions auto:run1_mc  --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --prefix 'python2 /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 14400 '  -n 100  --filein  file:step1.root  --fileout file:step2.root  --suffix "-j JobReport2.xml "  --nThreads 4 

works. Changing HLT:@fake to HLT:@relval2026 also causes wf 1.0 step2 to fail with same error [a].

CMSSW_14_0_ASAN_X_2024-01-22-2300...CMSSW_14_0_ASAN_X_2024-01-24-2300 are the changes after which we start seeing these failures.

Any idea what HLT:@relval2026 does? Does it run any external process which try to load cmssw libs?

[a]
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_0_ASAN_X_2024-01-29-2300/pyRelValMatrixLogs/run/24834.0_TTbar_14TeV+2026D98/step2_TTbar_14TeV+2026D98.log#/

==2720140==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 31, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@sextonkennedy, @Dr15Jones, @makortel, @rappoccio, @smuzaffar, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

by the way, one option to avoid this error is to include verify_asan_link_order=0 in ASAN_OPTIONS env variable.

@mmusich
Copy link
Contributor

mmusich commented Jan 31, 2024

Any idea what HLT:@relval2026 does?

it runs a real (simplified) HLT menu for the Phase-2 upgrade detector. HLT:@fake doesn't run any reconstruction (just pass-through-s of L1 bits).

Does it run any external process which try to load cmssw libs?

can you be more specific?

@mmusich
Copy link
Contributor

mmusich commented Jan 31, 2024

@cms-sw/hlt-l2 @rovere @SohamBhattacharya FYI

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Jan 31, 2024

Any idea what HLT:@relval2026 does?

it runs a real (simplified) HLT menu for the Phase-2 upgrade detector. HLT:@fake doesn't run any reconstruction (just pass-through-s of L1 bits).

Does it run any external process which try to load cmssw libs?

can you be more specific?

I mean does it run/execute any binary which is not part of CMSSW itself ( means it is not built explicitly linked to libasan.so) and that executable try to do dlopen of cmssw libs? One thing I can think of it so start an external pytgon script which uses ROOT python interface

@smuzaffar
Copy link
Contributor Author

e.g. runing the following in ASAN fails with same error

python3 -c 'import ROOT;print(ROOT.reco.GsfElectron.mvaPlaceholder)'
==27851==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

so if HLT:@relval2026 running something which uses ROOT python then this can trigger this error.

I will suggest that we enable verify_asan_link_order=0 so that asan does not complain about libasan load order.

@makortel
Copy link
Contributor

makortel commented Jan 31, 2024

I looked into what happens in the cmsDriver.py program with strace, and that revealed that python loading L1Trigger.Phase2L1GT.l1tGTScales caused the libL1TriggerPhase2L1GT.so.

The l1GTScales indeed explicitly loads the library

from libL1TriggerPhase2L1GT import L1GTScales as CppScales

and uses the CppScales in
l1tGTScales = CppScales(*[param.value() for param in scale_parameter.parameters_().values()])

Searching with git grep how the l1tGTScales object is used, I found no uses for it. Every other python module in CMSSW imports only the scale_parameter PSet. Commenting out the lines 1 and 22 made the cmsDriver command succeed.

@cms-sw/l1-l2 What is the purpose of the L1GTScales as CppScales in the python configuration?

This construct was added in #41808

@makortel
Copy link
Contributor

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

I will suggest that we enable verify_asan_link_order=0 so that asan does not complain about libasan load order.

Given the cause, I'd prefer to keep it :)

@makortel
Copy link
Contributor

Commenting out the lines 1 and 22 made the cmsDriver command succeed.

I opened #43826 to remove those lines.

@makortel
Copy link
Contributor

makortel commented Feb 7, 2024

@cms-sw/l1-l2 What is the purpose of the L1GTScales as CppScales in the python configuration?

@cms-sw/l1-l2 Even if the #43826 is fully signed, I think it would be useful (for longer term) to understand what is/was the purpose of of using L1GTScales in the python configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants