-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[14_0_X backport] Get AXOL1TL model version from menu as of utm v0.12.0 #44397
[14_0_X backport] Get AXOL1TL model version from menu as of utm v0.12.0 #44397
Conversation
A new Pull Request was created by @quinnanm for CMSSW_14_0_X. It involves the following packages:
@mmusich, @perrotta, @aloeliger, @cmsbuild, @epalencia, @consuegs, @Martin-Grunewald, @saumyaphor4252 can you please review it and eventually sign? Thanks. cms-bot commands are listed here
|
cms-bot internal usage |
Pull request #44397 was updated. @Martin-Grunewald, @consuegs, @aloeliger, @perrotta, @epalencia, @mmusich, @cmsbuild, @saumyaphor4252 can you please check and sign again. |
test parameters: |
@cmsbuild, please test |
backport of #44054 |
backport of #44054 |
-1 Failed Tests: RelVals RelVals-INPUT RelVals
Expand to see more relval errors ...RelVals-INPUT |
Ok, this syntax is not correct. |
Is there a way to see the actual counts? Not the differences. My suspicion is that the AXO results in 14_1 are off due the issue @quinnanm described above. The AXO score in 14_1 is huge and therefore might fire most events. We are investigating if some upstream changes might have caused AXO scores change when moving from 14_0 to 14_1. |
For the record:
If you click on the test results you can click on the workflow of interest and then you will have several JSON files. For each trigger path you will have the number of counts for the baseline and for the PR.
the reference is the last IB for the corresponding cycle, it's actually written in the summary if you click e.g. here you can see that this was tested against CMSSW_14_0_X_2024-03-15-2300 |
Thanks @mmusich for the pointers.
See the code here: https://gist.github.com/artlbv/f82845607b5fe18c9e35c58502874cc5 And these are the results:
So as I expected above, in 14_1 we did not see any differences in the AXO results since ALL events were firing even in the baseline. |
To summarise for @perrotta @mmusich @Martin-Grunewald @missirol @eyigitba @slaurila @caruta @aloeliger @epalencia @quinnanm et al: IMO in 14_0 everything is fine with AXO, including this PR. These corrected AXO thresholds also correspond to those used in the HLT timing and rate studies in https://its.cern.ch/jira/browse/CMSHLT-3049 by @jpearkes -> which were signed off by TSG. Would this be enough to justify keeping AXO in for the first HLT menu? As for 14_1 we ought to debug this within L1, but given this release will not be built until August, I imagine this has another timeline. |
thanks for the detailed analysis. I tend to agree, as already reported in #44397 (comment).
hmm, no.
I am removing the hold as given the explanations provided I don't have reason to think that the AXO triggers produce irreproducibiities in the trigger results, but I am not going to sign off until the deployment strategy is clarified within TSG: the new 2024 GRun menu v1.0 was already delivered and handed off to TSG/FOG for deployment (based on CMSSW_14_0_2 without this PR and
I opened #44435 to track progress on that. |
unhold |
@artlbv, @quinnanm, @mmusich I'm catching up after other work yesterday afternoon. Melissa has seen that removing the CICADA emulator changes AXO's answers. The CICADA team (myself actually) caught this one earlier this week looking at CICADA firmware/emulator matching. We could run multiple CICADA emulators at the same time and get different answers from a model than we would running only one emulator. I couldn't figure it out and reverted to only doing one at a time because that's how the emulator works anyways and I suspected it was some strange internal CICADA version that could be sorted out at some point later. What I suspect now now is that this issue is fundamentally symbol collision. Strictly, there isn't anything wrong with CICADA or AXO, the issue is that they share c++ code/symbols that are not specific to themselves in the dynamic linking case used by the emulator technique. My theory on what is happening is this: n 14_1 we updated the CICADA model to 1.1.1. One of the layers of CICADA 1.1.1 is weights w2, seen here: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/cicada.cpp#L36 AXO, by extension, uses w2 (and b2) as well: https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/GTADModel_v3.cpp#L65 https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/weights.h#L5 (notably, they are also sized such that AXO could comfortably use CICADA's weights). These weights are defined globally for both models. As the CMSSW job runs, CICADA gets loaded and created, and these weights loaded into the symbol table here by the dlopen here: https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/17790de8f2f2892dfd8ff20fead8eedd3cf59b49/src/hls4ml/emulator.cc#L21. Later, the AXO model gets created, and the load also attempted. I suspect that behind the scenes, dlopen sees that it would be loading things into the symbol table that are already defined, and quietly drops it in favor of things that are already defined. In essence, when we upgraded CICADA to 1.1.1, by no real fault of our own, we ended up accidentally switching AXO to CICADA's weights. I suspect a quick and dirty solution would be to do to CICADA what AXOL1TL has accidentally been doing (and we've all been asking AXO to stop ironically), which is create and destroy itself in one swoop, with no real persistence for model objects beyond creating a result. In the longer run, the usage of this emulator technique needs to be fixed or changed. |
By "in the longer run" you mean "before this can be run reliably within the L1T emulator running at HLT and in the online DQM", right ? |
@aloeliger |
code-checks |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44397/39537 ERROR: Unable to merge PR. See log https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44397/39537/cms-checkout-topic.log |
-hlt
|
Hmm, I suppose that means we need to revert from v1_0_1 of the L1T menu back to v1_0_0 of the LT1 menu in the GTs in 14_0 and 14_1? @perrotta |
Yes, this was indeed the outcome of the JointOps meeting yesterday |
-alca |
This can be closed as superseded by #44510 |
@cmsbuild, please close |
PR description:
This is a backport of #44054 to update the AXOL1TL emulator to be consistent with new utm grammar v0_12_0 and menu as of L1Menu_Collisions2024_v1_0_1_xml. Changes are identical to 44054 See: https://cms-talk.web.cern.ch/t/update-of-the-2024-l1-menu-tag-l1menu-collisions2024-v1-0-1/36759.
PR validation:
This must be tested and merged together with cms-sw/cmsdist#9065 and upcoming backport of #44389.
Passes basic code formats and checks.
backport of 44054 to 14_0_X