Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14_0_X backport] Get AXOL1TL model version from menu as of utm v0.12.0 #44397

Closed

Conversation

quinnanm
Copy link
Contributor

PR description:

This is a backport of #44054 to update the AXOL1TL emulator to be consistent with new utm grammar v0_12_0 and menu as of L1Menu_Collisions2024_v1_0_1_xml. Changes are identical to 44054 See: https://cms-talk.web.cern.ch/t/update-of-the-2024-l1-menu-tag-l1menu-collisions2024-v1-0-1/36759.

PR validation:

This must be tested and merged together with cms-sw/cmsdist#9065 and upcoming backport of #44389.

Passes basic code formats and checks.

backport of 44054 to 14_0_X

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 14, 2024

A new Pull Request was created by @quinnanm for CMSSW_14_0_X.

It involves the following packages:

  • Configuration/AlCa (alca)
  • Configuration/HLT (hlt)
  • HLTrigger/Configuration (hlt)
  • L1Trigger/L1TGlobal (l1)

@mmusich, @perrotta, @aloeliger, @cmsbuild, @epalencia, @consuegs, @Martin-Grunewald, @saumyaphor4252 can you please review it and eventually sign? Thanks.
@missirol, @mmusich, @rsreds, @silviodonato, @Martin-Grunewald, @tocheng, @yuanchao, @fabiocos this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 14, 2024

cms-bot internal usage

@quinnanm quinnanm marked this pull request as draft March 14, 2024 02:26
@cmsbuild
Copy link
Contributor

Pull request #44397 was updated. @Martin-Grunewald, @consuegs, @aloeliger, @perrotta, @epalencia, @mmusich, @cmsbuild, @saumyaphor4252 can you please check and sign again.

@Martin-Grunewald
Copy link
Contributor

@perrotta
Now that we have this L1T backprt PR, we'd also need the 14_0 backport of your #44389 .

@perrotta
Copy link
Contributor

@perrotta Now that we have this L1T backprt PR, we'd also need the 14_0 backport of your #44389 .

Here it is: #44398

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2024

test parameters:

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2024

@cmsbuild, please test

@perrotta
Copy link
Contributor

perrotta commented Mar 14, 2024

backport of #44054

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2024

backport of #44054

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-220365/38122/summary.html
COMMIT: 4c59bdc
CMSSW: CMSSW_14_0_X_2024-03-13-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44397/38122/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

----- Begin Fatal Exception 14-Mar-2024 10:37:49 CET-----------------------
An exception of category 'ConditionsError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'gtStage2Raw'
   [5] Calling method for module L1TGlobalProducer/'simGtStage2Digis'
Exception Message:
 Error L1T menu version L1Menu_Collisions2024_v1_0_0 is unsupported due to incompatible utm grammar, please use L1Menu_Collisions2024_v1_0_1 or later
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 14-Mar-2024 10:38:18 CET-----------------------
An exception of category 'ConditionsError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'gtStage2Raw'
   [5] Calling method for module L1TGlobalProducer/'simGtStage2Digis'
Exception Message:
 Error L1T menu version L1Menu_Collisions2024_v1_0_0 is unsupported due to incompatible utm grammar, please use L1Menu_Collisions2024_v1_0_1 or later
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 14-Mar-2024 10:38:30 CET-----------------------
An exception of category 'ConditionsError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'gtStage2Raw'
   [5] Calling method for module L1TGlobalProducer/'simGtStage2Digis'
Exception Message:
 Error L1T menu version L1Menu_Collisions2024_v1_0_0 is unsupported due to incompatible utm grammar, please use L1Menu_Collisions2024_v1_0_1 or later
----- End Fatal Exception -------------------------------------------------
Expand to see more relval errors ...

RelVals-INPUT

  • 12834.012834.0_TTbar_14TeV+2024/step2_TTbar_14TeV+2024.log
  • 12861.012861.0_NuGun+2024/step2_NuGun+2024.log
  • 12846.012846.0_ZEE_14+2024/step2_ZEE_14+2024.log
Expand to see more relval errors ...

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2024

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2024

@artlbv
Copy link
Contributor

artlbv commented Mar 16, 2024

On the other hand it's not clear to me why in CMSSW_14_1_X instead we don't observe changes in the trigger results

Is there a way to see the actual counts? Not the differences.
What is actually the reference for these comparisons?

My suspicion is that the AXO results in 14_1 are off due the issue @quinnanm described above. The AXO score in 14_1 is huge and therefore might fire most events.
If that was also the case for the reference in 14_1, then it would explain why changing the thresholds in the L1 Menu had no effect on the trigger results.

We are investigating if some upstream changes might have caused AXO scores change when moving from 14_0 to 14_1.

@mmusich
Copy link
Contributor

mmusich commented Mar 16, 2024

For the record:

Is there a way to see the actual counts? Not the differences.

If you click on the test results you can click on the workflow of interest and then you will have several JSON files. For each trigger path you will have the number of counts for the baseline and for the PR.

What is actually the reference for these comparisons?

the reference is the last IB for the corresponding cycle, it's actually written in the summary if you click e.g. here you can see that this was tested against CMSSW_14_0_X_2024-03-15-2300

Screenshot from 2024-03-16 12-27-05

@artlbv
Copy link
Contributor

artlbv commented Mar 16, 2024

Thanks @mmusich for the pointers.
I now took the trigger test result for the workflow with discrepancies 141.044 for both PRs:

## 14_0 Backport results: 
## JSON https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_14_0_X_2024-03-15-2300+220365/61714/triggerResults/141.044_RunJetMET2023D/reHLT.json
## from tests https://github.com/cms-sw/cmssw/pull/44397#issuecomment-2001893129 
## "reHLT_140_backport.json"

## 14_1 PR results:
## JSON https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_14_1_X_2024-03-13-1100+5aa5c5/61657/triggerResults/141.044_RunJetMET2023D/reHLT.json
## From comment https://github.com/cms-sw/cmssw/pull/44054#issuecomment-1996266443
# fname = "reHLT_141.json"

See the code here: https://gist.github.com/artlbv/f82845607b5fe18c9e35c58502874cc5

And these are the results:

index 140 backport_old 140 backport_new 141 PR_old 141 PR_new
DST_PFScouting_AXONominal_v1 2 0 100 100
DST_PFScouting_AXOTight_v1 1 0 100 100
HLT_L1AXOVTight_v1 0 0 100 100
Dataset_ScoutingPFMonitor 1 1 2 2
Dataset_ScoutingPFRun3 75 75 100 100

So as I expected above, in 14_1 we did not see any differences in the AXO results since ALL events were firing even in the baseline.

@artlbv
Copy link
Contributor

artlbv commented Mar 16, 2024

To summarise for @perrotta @mmusich @Martin-Grunewald @missirol @eyigitba @slaurila @caruta @aloeliger @epalencia @quinnanm et al:

IMO in 14_0 everything is fine with AXO, including this PR.
The only reason this PR changes the trigger results is due to the update of the L1 Menu where only the AXO seed thresholds were modified (corrected according to the JIRA. The rest of the L1 menu is unaffected as the results show.

These corrected AXO thresholds also correspond to those used in the HLT timing and rate studies in https://its.cern.ch/jira/browse/CMSHLT-3049 by @jpearkes -> which were signed off by TSG.

Would this be enough to justify keeping AXO in for the first HLT menu?

As for 14_1 we ought to debug this within L1, but given this release will not be built until August, I imagine this has another timeline.

@mmusich
Copy link
Contributor

mmusich commented Mar 16, 2024

@artlbv

IMO in 14_0 everything is fine with AXO, including this PR.

thanks for the detailed analysis. I tend to agree, as already reported in #44397 (comment).

These corrected AXO thresholds also correspond to those used in the HLT timing and rate studies in https://its.cern.ch/jira/browse/CMSHLT-3049 by @jpearkes -> which were signed off by TSG.

hmm, no.
TSG signed-off based on the hlt integration results (see attachment at https://its.cern.ch/jira/secure/attachment/554837/test.log) which as the log clearly reports were based on L1Menu_Collisions2024_v0_0_0_xml not on a different menu using different thresholds. Actually I myself learned that L1Menu_Collisions2024_v1_0_1_xml contained updated thresholds only thanks to this message from @caruta #44397 (comment). I think there has been a serious lack of communication between LT1 DPG and TSG in this respect.

Would this be enough to justify keeping AXO in for the first HLT menu?

I am removing the hold as given the explanations provided I don't have reason to think that the AXO triggers produce irreproducibiities in the trigger results, but I am not going to sign off until the deployment strategy is clarified within TSG: the new 2024 GRun menu v1.0 was already delivered and handed off to TSG/FOG for deployment (based on CMSSW_14_0_2 without this PR and L1Menu_Collisions2024_v1_0_0_xml).

As for 14_1 we ought to debug this within L1, but given this release will not be built until August, I imagine this has another timeline.

I opened #44435 to track progress on that.

@mmusich
Copy link
Contributor

mmusich commented Mar 16, 2024

unhold

@cmsbuild cmsbuild removed the hold label Mar 16, 2024
@aloeliger
Copy link
Contributor

@artlbv, @quinnanm, @mmusich I'm catching up after other work yesterday afternoon.

Melissa has seen that removing the CICADA emulator changes AXO's answers.

The CICADA team (myself actually) caught this one earlier this week looking at CICADA firmware/emulator matching. We could run multiple CICADA emulators at the same time and get different answers from a model than we would running only one emulator. I couldn't figure it out and reverted to only doing one at a time because that's how the emulator works anyways and I suspected it was some strange internal CICADA version that could be sorted out at some point later.

What I suspect now now is that this issue is fundamentally symbol collision. Strictly, there isn't anything wrong with CICADA or AXO, the issue is that they share c++ code/symbols that are not specific to themselves in the dynamic linking case used by the emulator technique.

My theory on what is happening is this:

n 14_1 we updated the CICADA model to 1.1.1. One of the layers of CICADA 1.1.1 is weights w2, seen here: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/cicada.cpp#L36
https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v1p1p1/weights/w2.h#L12. Note that before, in model 2.1, w2 was not in use: https://github.com/cms-hls4ml/CICADA/blob/2baca92cc3f6041e98d43c7391b9e7eba6ed249a/CICADA_v2p1/myproject.cpp#L57

AXO, by extension, uses w2 (and b2) as well: https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/GTADModel_v3.cpp#L65 https://github.com/cms-hls4ml/AXOL1TL/blob/f21d72bc08ab6e7a86292db9ff0637532a311c2c/AXOL1TL_v3/NN/weights.h#L5 (notably, they are also sized such that AXO could comfortably use CICADA's weights).

These weights are defined globally for both models. As the CMSSW job runs, CICADA gets loaded and created, and these weights loaded into the symbol table here by the dlopen here: https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/17790de8f2f2892dfd8ff20fead8eedd3cf59b49/src/hls4ml/emulator.cc#L21. Later, the AXO model gets created, and the load also attempted. I suspect that behind the scenes, dlopen sees that it would be loading things into the symbol table that are already defined, and quietly drops it in favor of things that are already defined.

In essence, when we upgraded CICADA to 1.1.1, by no real fault of our own, we ended up accidentally switching AXO to CICADA's weights.

I suspect a quick and dirty solution would be to do to CICADA what AXOL1TL has accidentally been doing (and we've all been asking AXO to stop ironically), which is create and destroy itself in one swoop, with no real persistence for model objects beyond creating a result.

In the longer run, the usage of this emulator technique needs to be fixed or changed.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 16, 2024

In the longer run, the usage of this emulator technique needs to be fixed or changed.

By "in the longer run" you mean "before this can be run reliably within the L1T emulator running at HLT and in the online DQM", right ?

@Martin-Grunewald
Copy link
Contributor

@aloeliger
Could you please comment on the plans and timeline for this PR? IOW, imminent or delayed, etc.?

@Martin-Grunewald
Copy link
Contributor

code-checks

@cmsbuild
Copy link
Contributor

@mmusich
Copy link
Contributor

mmusich commented Mar 19, 2024

-hlt

@Martin-Grunewald
Copy link
Contributor

Hmm, I suppose that means we need to revert from v1_0_1 of the L1T menu back to v1_0_0 of the LT1 menu in the GTs in 14_0 and 14_1? @perrotta

@perrotta
Copy link
Contributor

Hmm, I suppose that means we need to revert from v1_0_1 of the L1T menu back to v1_0_0 of the LT1 menu in the GTs in 14_0 and 14_1? @perrotta

Yes, this was indeed the outcome of the JointOps meeting yesterday

@perrotta
Copy link
Contributor

-alca
@quinnanm this can probably get closed, given what summarized in #44397 (comment)

@artlbv
Copy link
Contributor

artlbv commented Mar 23, 2024

This can be closed as superseded by #44510

@mmusich
Copy link
Contributor

mmusich commented Mar 25, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants