Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in DQM Harvesting step with EgHLTOfflineClient #38970

Open
rvenditti opened this issue Aug 4, 2022 · 8 comments
Open

Problem in DQM Harvesting step with EgHLTOfflineClient #38970

rvenditti opened this issue Aug 4, 2022 · 8 comments

Comments

@rvenditti
Copy link
Contributor

As a follow up of Express job killed at T0 for memory issues at harvesting step in runs 356381 and 356615 link we found that the log file shows some problem in HLT-Egamma client:
The message is
%MSG-e HLTConfigProvider: EgHLTOfflineClient:egHLTOffDQMClient@beginRun 29-Jul-2022 10:57:14 CEST Run: 356381
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider: EgHLTOfflineClient:egHLTOffDQMClient@beginRun 29-Jul-2022 10:57:14 CEST Run: 356381
Process name 'HLT' not found in registry!
%MSG

We believe that this could lead to the memory issue seen in the Express reconstruction. Can HLT DQM experts have a look?

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 4, 2022

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Aug 4, 2022

assign dqm

@makortel
Copy link
Contributor

makortel commented Aug 4, 2022

FYI @cms-sw/hlt-l2 @cms-sw/egamma-pog-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 4, 2022

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@swagata87
Copy link
Contributor

I've seen the Process name 'HLT' not found in registry! issue before[1], but as far as I am aware, it's been there since quite some time now and probably wasn't fixed yet as it seems like a rather harmless error message (although need to be debugged and fixed at some point). I'll be surprised if this creates memory issue.

Btw, looking at the log files of the 2 runs, I see several other error messages. For example:

%MSG-e MergeFailure:  source 29-Jul-2022 10:16:10 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG
%MSG
%MSG-e DQMGenericClient:  DQMGenericClient:HiJetClient@endRun  29-Jul-2022 10:58:33 CEST End Run: 356381
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/HI !!!
%MSG
%MSG-e DQMCorrelationClient:   DQMCorrelationClient:pixelClusterVsLumiPXBarrel@endProcessBlock  29-Jul-2022 10:59:03 CEST post-events
MEs not found! HLT/Pixel/num_clusters_per_Lumisection_PXBarrel not found
%MSG
%MSG-e DQMGenericClient:   HLTMuonRefMethod:hltMuonRefEfficienciesMR@endProcessBlock  03-Aug-2022 08:44:14 CEST post-events
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG

Could any of these trigger the memory issue? @rvenditti

[1] https://cms-talk.web.cern.ch/t/replay-for-testing-the-run-3-collisions-setup/10676

@rvenditti
Copy link
Contributor Author

rvenditti commented Aug 5, 2022

Hi @swagata87 thanks for the comment, indeed we have asked also to CTPPS experts #38969 to have a look.
BTW, as it is pointed in https://cms-talk.web.cern.ch/t/replay-for-testing-the-run-3-collisions-setup/10676, the responsible for the memory issue could be something completely different from the warnings in the cmsRun-stdout.log file since the warnings seem to be there since long time).

@germanfgv are there any other files to be checked in the job report folder from which we can access the stack trace for this job?

@germanfgv
Copy link
Contributor

@rvenditti we don't have access to the stack trace of the job at the moment of termination. You can find 3 sets of log files in the tarball:

Condor logs: _condor_std*
Agent logs: wmagentJob.log (here you can see the performance monitor scanning the use of memory periodically)
cmsRun logs:job/WMTaskSpace/cmsRun1/cmsRun1-stdo*

Other than that, nomore information is available

@missirol
Copy link
Contributor

missirol commented Aug 9, 2022

Maybe (re-)stating the obvious: the HLT-related warnings are unrelated to the main issue, i.e. #38976.

I had a look at the warnings, and I think their origin is clear: the Harvesting modules in question, i.e. instances of EgHLTOfflineClient and EgHLTOfflineSummaryClient, use HLTConfigProvider to find the names of relevant e/gamma HLT paths and filters, and those names are then used to look for input histograms (or, 'monitor elements'), and create outputs (e.g. efficiency graphs, etc). The problem is that HLTConfigProvider will fail and issue a warning when running on DQMIO files, as it won't find there the relevant inputs with process label "HLT". I believe (but haven't checked) these Harvesting modules will instead work as is when DQM+Harvesting steps run on EDM inputs (e.g. AOD files).

In this particular example (and this is maybe not true in other cases), egHLTOffDQMClient uses (runClientEndJob = False, runClientEndLumiBlock = False, runClientEndRun = True), but runClientEndRun is never used inside the plugin, and ultimately the function runClient_ (which creates the harvesting outputs) would not run in any case (regardless of the issue with HLTConfigProvider..):
https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_X/DQMOffline/Trigger/plugins/EgHLTOfflineClient.cc#L43
https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_X/DQMOffline/Trigger/plugins/EgHLTOfflineClient.cc#L91

I think one could try to improve these Harvesting modules by extracting the relevant filter/path names based on the available input histograms; this way, the module could work both (1) on DQMIO inputs and (2) when DQM+Harvesting run on EDM inputs. Before updating the plugins though, it should probably be clarified whether these plugins are actually important and worth updating. This can only be answered by EGM (@swagata87) and DQM experts.

(Reminder: the workflows of the HLT offline-DQM are maintained by DQM, and mostly developed by POGs; they are not under the direct watch of HLT L2s.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants