Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move CaloL1 online DQM to Global #32004

Merged
merged 5 commits into from
Nov 17, 2020

Conversation

aloeliger
Copy link
Contributor

PR description:

This PR modifies the CaloL1 DQM online module to become "Global" format to comply with the concurrent LS processing introduced for offline DQM as a part of #25090. This module preserves all monitoring elements (and therefore summary tests) by use of maps containing relevant luminosity block information as part of a CaloL1 namespace structure used as a part of the run cache. This PR should make PR #31678 obsolete.

PR validation:

All code compiles, and I have checked that a min bias data workflow reproduces the same histograms between the old online module and new online module. I have also verified that error plots between the old and new version of the online module yield the same error information in the same way by comparing on Run 315273 (a 2018 HI run with HCAL mismatches at CaloL1). Online monitoring should remain unchanged. As well, the PR seems to pass the basic test procedure suggested in the CMSSW PR instructions.

if this PR is a backport please specify the original PR and why you need to backport that PR:

This PR is not a backport

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32004/19521

  • This PR adds an extra 28KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

A new Pull Request was created by @aloeliger for master.

It involves the following packages:

DQM/L1TMonitor

@andrius-k, @kmaeshima, @ErnestaP, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 2, 2020

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

+1
Tested at: ca6de1d
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-06ec13/10446/summary.html
CMSSW: CMSSW_11_2_X_2020-11-02-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

Comparison job queued.

if (theMap->find(lumiBlock) != theMap->end()) {
//this is not the first event with errors per lumi-block, insert it only if the number of errors is larger than before
if (nErrors > theMap->find(lumiBlock)->second) {
(*theMap)[lumiBlock] = nErrors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is not thread safe (I'm saying it only here and not repeat everywhere). The compiler would also complain if the map would be taken as const std::map<std::string, int>& (unfortunately "const pointer" is not the same as "(const) pointer to const", which is what would be really needed here).

If I understood correctly, the aim is to find a maximum of some quantity over a LuminosityBlock, and eventually put that number into a bin of a histogram that corresponds that lumi. I believe the most efficiency way to do that (in general) would be to (following the example https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface#example_usage_calculating_per_lu)

  • Use edm::StreamCache with an int for each quantity to be recorded over a lumi
    • In dqmAnalyze() just do along streamCache(event.streamID).var = std::max(streamCache(event.streamID).var, localVar) (*)
  • Use edm::LuminosityBlockSummaryCache with an int for each quantity to do a reduction (with std::max() operation) over the edm streams
    • streamEndLuminosityBlockSummary() stores the maximum of the StreamCache and the LumiSummaryCache to the LumiSummaryCache and resets the StreamCache value(s) to zero
    • globalEndLuminosityBlockSummary() stores the reduced maximum into a container (see next bullet) along with the lumi id
  • Use tbb::concurrent_vector<std::pair<edm::LuminosityBlockNumber_t, int>> in the RunCache (i.e. in CaloL1Information::monitoringDataHolder) to hold the values per lumi for that run
    • dqmEndRun() copies the values from the container into the histogram

(alternatively a edm::LuminosityBlockCache could be combined with std::atomic<int>s)

(*) for @cms-sw/dqm-l2, the DQMGlobalEDAnalyzer API would be more consistent with edm::global interfaces if the edm::StreamID would be passed from accumulate() to dqmAnalyze()

Copy link
Contributor Author

@aloeliger aloeliger Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makortel Thanks for the comments. I was afraid this might be the case.

If I understood correctly, the aim is to find a maximum of some quantity over a LuminosityBlock, and eventually put that number into a bin of a histogram that corresponds that lumi

This is correct. Unfortunately, I think I've tried to use LuminosityBlockCaches to gain access to the luminosity block transition functions and cache information in this manner (the old implementation of this module was designed to fill things on luminosity block transitions, this seemed like the quickest way to regain all functionality). My attempt was something similar to:

class L1TStage2CaloLayer1Offline : public DQMGlobalEDAnalyzer<edm::LuminosityBlockCache<CaloL1Information::monitoringDataHolder>>

as the main implementation of the module, but his doesn't work due to the way DQM global are implemented, as run level caching is assumed at the dqm global analyzer level here:

class DQMGlobalEDAnalyzer
: public edm::global::EDProducer<edm::RunCache<H>,
// DQMGlobalEDAnalyzer are fundamentally unable to produce histograms for any
// other scope than MonitorElement::Scope::RUN.
edm::EndRunProducer,
edm::Accumulator,
Args...> {

So I think this rules out every solution you mentioned but the last. However, I am unfamiliar with tbb::concurrent_vector<>'s. Is there documentation I should be aware of for using these? I assume these objects are just a vector with some sort of inherent locking implemented?

I had also considered an even less verbose solution, comparing and setting in the monitoring element directly for every event, I assume this would be no more thread safe than what I have implemented right now?

Thanks again for your help,
Andrew Loeliger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My attempt was something similar to:

class L1TStage2CaloLayer1Offline : public DQMGlobalEDAnalyzer<edm::LuminosityBlockCache<CaloL1Information::monitoringDataHolder>>

as the main implementation of the module, but his doesn't work due to the way DQM global are implemented, as run level caching is assumed at the dqm global analyzer level here:

https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/interface/DQMGlobalEDAnalyzer.h#L13-L19

The DQMGlobalEDAnalyzer indeed requires/uses a RunCache for the MonitorElements, but (as far as I can tell) that does not preclude the use of other caches for other information (especially when the functions specific to those caches do not use any DQM APIs).

So I think this rules out every solution you mentioned but the last.

I meant that all the three bullet points need to be used together.

The alternative I suggested in the parentheses under the bullet points would replace the use of edm::StreamCache and edm::LuminosityBlockSummaryCache with edm::LuminosityBlockCache and explicit use of std::atomic<int> (which would require a somewhat-low-level loop using std::atomic<int>::compare_exchange_weak() to achieve the max() operation).

However, I am unfamiliar with tbb::concurrent_vector<>'s. Is there documentation I should be aware of for using these? I assume these objects are just a vector with some sort of inherent locking implemented?

It provides fine-grained synchronization (and thus might be a bit overkill for just concurrent push_back()). The interface is very std::vector-like (including push_back() and iteration over elements). For more information please see
https://www.threadingbuildingblocks.org/docs/help/tbb_userguide/concurrent_vector_ug.html
https://www.threadingbuildingblocks.org/docs/doxygen/a00046.html

I had also considered an even less verbose solution, comparing and setting in the monitoring element directly for every event, I assume this would be no more thread safe than what I have implemented right now?

Right, not without a per-histogram locking (which the MonitorElement actually has, but is not exposed outside, for good reasons).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DQMGlobalEDAnalyzer indeed requires/uses a RunCache for the MonitorElements, but (as far as I can tell) that does not preclude the use of other caches for other information (especially when the functions specific to those caches do not use any DQM APIs).

Then, can I ask what is the appropriate syntax to use when trying to mix and use the different cache scopes at various levels? As I mentioned, I tried the above syntax:

class L1TStage2CaloLayer1Offline : public DQMGlobalEDAnalyzer<edm::LuminosityBlockCache<CaloL1Information::monitoringDataHolder>>

but later, when trying to retrieve this luminosity block cache doing something like:

if (nEcaLLinkErrors > luminosityBlockCache(event.getLuminosityBlock().index())->maxEvtLinkErrorsECALCurrentLumi_)

This would always error in the compiler as it didn't recognize the luminosityBlockCache() function for retrieving the luminosity block cache (it is also perhaps worth noting that it would also not recognize overrides of the global luminosity block transition functions, stating the override was not overriding anything previously defined).

This error was unique to the DQMGlobalEDAnalyzer implementation, when changing from DQMGlobalEDAnalyzer to DQMOneEDAnalyzer these lines passed the compiler without complaint.

Is there a method I am unaware of to use these two different cache scopes together?

Thanks,
Andrew Loeliger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one

class L1TStage2CaloLayer1Offline : public DQMGlobalEDAnalyzer<edm::LuminosityBlockCache<CaloL1Information::monitoringDataHolder>>

declares to use edm::LuminosityBlockCache<CaloL1Information::monitoringDataHolder> as the RunCache (which I fully believe can lead to compilation errors`). In order to use multiple caches, each of them needs to be declared separately, e.g.

// note CMS naming convention is start namespaces in lower case, and classes/structs in upper case
namespace caloL1Information {
  struct MonitoringDataHolder {
    // MonitorElements;
    tbb::concurrent_vector<std::pair<edm::LuminosityBlockID, int>> maxEvtLinkErrorsMapECAL;
    ...
  };

  struct PerStreamMax {
    int maxEvtLinkErrorsMapECAL = 0;
    ...
  };

  struct PerLumiMax {
    int maxEvtLinkErrorsMapECAL = 0;
    ...
  };
}

class L1TStage2CaloLayer1Offline : public DQMGlobalEDAnalyzer<
  caloL1Information::MonitoringDataHolder, // RunCache, not decorated with edm::RunCache<...> because the base class does that
  edm::StreamCache<caloL1Information::PerStreamMax>,
  edm::LuminosityBlockSummaryCache<caloL1Information::PerLumiMax>>

std::shared_ptr<caloL1Information::PerStreamMax> beginStream(edm::StreamID) const {
  return std::make_shared<caloL1Information::PerStreamMax>();
}

std::shared_ptr<caloL1Information::PerLumiMax> globalBeginLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const& ) const {
  return std::make_shared<caloL1Information::PerLumiMax>();
}

// reduction over events within one stream (and by definition within one lumi)
// there may be concurrent calls to this function, but framework guarantees that there is at most one call for any single stream
// (and therefore StreamCache can be used safely without explicit synchronization)
void dqmAnalyze(edm::Event const& iEvent, edm::EventSetup const&, caloL1Information::MonitoringDataHolder const&) const {
  ...
  auto cache = streamCache(iEvent.streamID());
  cache->maxEvtLinkErrorsMapECAL = std::max(cache->maxEvtLinkErrorsMapECAL, nEcalLinkErrors);
  ...
}

// reduction over streams
// framework guarantees that this function is called one at a time
// (and therefore StreamCache and LuminosityBlockSummaryCache can be used without explicit synchronization)
void streamEndLuminosityBlockSummary(edm::StreamID streamID, edm::LuminosityBlock const&,                                      edm::EventSetup const&, caloL1Information::PerLumiMax* lumiSummaryCache) const {
  auto cache = streamCache(streamID);
  lumiSummaryCache->maxEvtLinkErrorsMapECAL = std::max(lumiSummaryCache->maxEvtLinkErrorsMapECAL, cache->maxEvtLinkErrorsMapECAL);
  ...
}

// store reduced maximum to tbb::concurrent_vector
void globalEndLuminosityBlockSummary(edm::LuminosityBlock const& iLumi, edm::EventSetup const&, caloL1Information::PerLumiMax* lumiSummaryCache) const {
  auto cache = runCache(iLumi.getRun().index());
  cache->maxEvtLinkErrorsMapECAL.emplace_back(iLumi.id(), lumiSummaryCache->maxEvtLinkErrorsMapECAL);
  ...
}

(please invent better names for the cache types :) )

//these maps store the maximum number of evt and link mismatches per lumi section.
//they are read back at the end of runs
//Made std::unique_ptr's for better memory management
std::unique_ptr<std::map<std::string, int>> maxEvtLinkErrorsMapECAL{std::make_unique<std::map<std::string, int>>()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is unique_ptr better here than

std::map<std::string, int> maxEvtLinkErrorsMapECAL;

?

Copy link
Contributor Author

@aloeliger aloeliger Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reasoning behind this was, as you were noting above, my way to dodge the compiler errors for a const function modifying class elements by making the class element an unedited pointer to some mutable object that I was working with (which, I grant you, seems like a textbook way to write terminally buggy code).

I think continuing from the above, what is necessary for this module to maintain it's previous functionality and also is a thread safe dynamically sized vector/map type of object whose contents are mutable in the analysis functions for storing luminosity block information at the run cached level. Will tbb::concurrent_vector<>'s provide this? I assume that there's also probably a considerable speed cost associated with an object like this?

Thanks,
Andrew Loeliger

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2020

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-06ec13/10446/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 35
  • DQMHistoTests: Total histograms compared: 2544092
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2544063
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
  • Checked 149 log files, 22 edm output root files, 35 DQM output files

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 9, 2020

@aloeliger the last test shows this change in one of the MEs of L1T / L1TStage2CaloLayer1 / MismatchDetail:
https://tinyurl.com/y6nqrycp

@aloeliger
Copy link
Contributor Author

Hi all,

I just wanted to give a quick check in of where this is right now.

I am looking at changing the code using the strategy suggested by @makortel using the mixed stream and summary caches to maintain stream safety, and I have all of the "Max" style histograms changed over to use this style, however I am running into technical difficulties trying to change last20Mismatches_ over.

last20Mismatches_ is a monitoring element that just catalogues the run, lumi and event of the last 20 mismatches in order, as well as the type of mismatch.

My initial thought for how to go about this was by using a vector of tuples per stream that would maintain a strict size of 20, and then using a RunSummaryCache use streamEndRunSummary to merge it with a similar vector at the run cache level, with functions for maintaining order and size of the vector. This vector can then be read from at the global run end to create the last20Mismatches monitoring element. However, when trying to add a RunSummaryCache to the current module like so:

192class L1TStage2CaloLayer1 : public DQMGlobalEDAnalyzer<CaloL1Information::monitoringDataHolder,
193  edm::StreamCache<CaloL1Information::perStreamMonitoringDataHolder>,
194  edm::LuminosityBlockSummaryCache<CaloL1Information::perLumiBlockMonitoringInformation>,
195  edm::RunSummaryCache<CaloL1Information::perRunSummaryMonitoringInformation>
196  > {

I seem to end up getting unresolvable compilation errors. From what I understand from the error:

In file included from /afs/cern.ch/work/a/aloelige/private/DQMDevelopment/src/DQM/L1TMonitor/interface/L1TStage2CaloLayer1.h:10,
                 from /afs/cern.ch/work/a/aloelige/private/DQMDevelopment/src/DQM/L1TMonitor/plugins/SealModule.cc:27:
/cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre6/src/DQMServices/Core/interface/DQMGlobalEDAnalyzer.h: In instantiation of 'class DQMGlobalEDAnalyzer<CaloL1Information::moni
toringDataHolder, edm::StreamCache<CaloL1Information::perStreamMonitoringDataHolder>, edm::LuminosityBlockSummaryCache<CaloL1Information::perLumiBlockMonitoringInformation>, edm::RunSummary
Cache<CaloL1Information::perRunSummaryMonitoringInformation> >':
/afs/cern.ch/work/a/aloelige/private/DQMDevelopment/src/DQM/L1TMonitor/interface/L1TStage2CaloLayer1.h:192:36:   required from here
/cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre6/src/DQMServices/Core/interface/DQMGlobalEDAnalyzer.h:58:8: error: 'void DQMGlobalEDAnalyzer<H, Args>::globalEndRunProduce(ed
m::Run&, const edm::EventSetup&) const [with H = CaloL1Information::monitoringDataHolder; Args = {edm::StreamCache<CaloL1Information::perStreamMonitoringDataHolder>, edm::LuminosityBlockSum
maryCache<CaloL1Information::perLumiBlockMonitoringInformation>, edm::RunSummaryCache<CaloL1Information::perRunSummaryMonitoringInformation>}]' marked 'final', but is not virtual
   void globalEndRunProduce(edm::Run& run, edm::EventSetup const& setup) const final {

and the linked DQMGlobalEDAnalyzer line:

void globalEndRunProduce(edm::Run& run, edm::EventSetup const& setup) const final {
auto const& h = *this->runCache(run.index());
dqmEndRun(run, setup, h);
dqmstore_->leaveLumi(run.run(), /* lumi */ 0, meId(run));
run.emplace(runToken_);
}

It seems that including a RunSummaryCache requires that I have a definition of the function globalEndRunProduce() as a class function, however this is already used by the DQM core class above and marked as final which would not allow me to create this as a class function.

My knowledge of c++ inheritance language features is a little rusty. I really don't need this function for any functionality of the module, so is there a way to define this in the L1TStage2CaloLayer1 module explicitly, but refer it to the DQM core class definition of this function in a way that will satisfy both the RunSummaryCache and the final marked function in DQMGlobalEDAnalyzer so that the RunSummaryCache can be used to try and retain the functionality of last20Mismatches_?

@aloeliger
Copy link
Contributor Author

Hello all,

I just wanted to check in and see if there was any guidance in how the last20Mismatches_ monitoring element could be preserved in a thread safe way. Is there a way that RunSummaryCache can be used together with DQMGlobalEDAnalyzer without this problem with globalEndRunProduce()?

@makortel
Copy link
Contributor

(apologies, this fell a bit below the radar)

Is there a way that RunSummaryCache can be used together with DQMGlobalEDAnalyzer without this problem with globalEndRunProduce()?

At the moment, I don't think it can, but there could be a relatively simple way to extend DQMGlobalEDAnalyzer to allow that. Let me experiment a bit.

@cmsbuild
Copy link
Contributor

+1
Tested at: 6db14ec
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-06ec13/10772/summary.html
CMSSW: CMSSW_11_2_X_2020-11-15-2300
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-06ec13/10772/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 35
  • DQMHistoTests: Total histograms compared: 2529296
  • DQMHistoTests: Total failures: 11
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2529263
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
  • Checked 148 log files, 22 edm output root files, 35 DQM output files

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 16, 2020

@aloeliger the difference in L1T / L1TStage2CaloLayer1 / MismatchDetail / last20Mismatches is still there in some WFs, I am not sure if your last changes were meant to fix it

@aloeliger
Copy link
Contributor Author

aloeliger commented Nov 16, 2020

Hi @jfernan2, these last few commits were designed to take care of the stream safety issues pointed out by @makortel, but it was not intended to change anything significant in terms of how last20Mismatches operates.

I have taken a look at the 136.731 workflow to understand what was going on.

These are the non-empty mismatch histograms reported back by the old version of the DQM code:

Old_136 731_hcalDiscrepancy

Old_136 731_HBHEmismatchesPerBX

Old_136 731_hcalMismatchByLumi

Old_136 731_last20Mismatches

and these are the histograms reported back by the this current PR code:

PR_136 731_hcalDiscrepancy

PR_136 731_HBHEmismatchesPerBx

PR_136 731_hcalMismatchByLumi

PR_136 731_last20Mismatches

Note that these histograms are identical, except that the old code reports no mismatches in the last20Mismatches monitoring element, despite there being reported mismatches for this workflow.

I suspect one of two things is happening here:

  1. The most likely of the two is that I have introduced a change in how last20Mismatches is filled, filling it at the end of a run, instead of at the end of a lumisection (I thought it would be easier to recreate what I thought was it's behavior this way). It is possible that the old code only shows the last 20 mismatches for a particular lumisection so the last lumisection of this workflow contains no mismatches (so none are shown there). However the run does contain mismatches, so they are shown in the new version. I think it makes more sense to me to have this monitoring element filled on the run scale, as filling it on the lumisection scale may be misleading. That being said, I could try to write the monitoring element such that it fills on lumisections instead to synchronize the plots, providing I had an idea of how to clear the monitoring element between lumisections.

  2. I did not write the original last20Mismatches code, so it could be that it was bugged in a way I was unaware of due to my unfamiliarity.

I have not discussed this particular change with the CaloL1 team yet, so I may need to do that, provided you are okay with this explanation.

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 16, 2020

Thanks @aloeliger I also vote for option 1. but I let L1T to decide

@aloeliger
Copy link
Contributor Author

@jfernan2 I have discussed with @asavincms and I think there are some more upgrades planned, but this should be okay from our side for now if it is okay for the DQM migration.

@jfernan2
Copy link
Contributor

OK @aloeliger but let's note down that the plot may be fired since now it is likely being correctly filled (by the end of the run)
Thanks

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

OK @aloeliger but let's note down that the plot may be fired since now it is likely being correctly filled (by the end of the run)
Thanks

@jfernan2 please open an issue to keep track of it

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit c77f047 into cms-sw:master Nov 17, 2020
@jfernan2
Copy link
Contributor

@jfernan2 please open an issue to keep track of it

I am not sure if this is actually an issue @silviodonato

My understanding from @aloeliger findings is that the plot is now correctly filled since it collects mismatches from the whole run, while before just from last LS.

I let L1T experts to confirm, since now the plot may be red in Offline DQM, while previously was hiding mismatches not produced in the last LS of the run, I am referring to this comment:
#32004 (comment)

@aloeliger
Copy link
Contributor Author

aloeliger commented Nov 17, 2020

@jfernan2 @silviodonato This is correct, and I believe it reports more accurately to what one would expect the behavior to be. However I think it is important to note that this plot is also not a part of the summary tests for the overall CaloL1 system

As seen here:

<TESTSCONFIGURATION>
<QTEST name="Layer1MismatchThreshold">
<TYPE>ContentsYRange</TYPE>
<PARAM name="warning">0.</PARAM>
<PARAM name="error">1.</PARAM>
<PARAM name="ymin">0.0</PARAM>
<PARAM name="ymax">10.0</PARAM>
<PARAM name="useEmptyBins">0</PARAM>
</QTEST>
<QTEST name="Layer1LinkErrorThreshold">
<TYPE>ContentsYRange</TYPE>
<PARAM name="warning">0.</PARAM>
<PARAM name="error">1.</PARAM>
<PARAM name="ymin">0.0</PARAM>
<PARAM name="ymax">10.0</PARAM>
<PARAM name="useEmptyBins">0</PARAM>
</QTEST>
<LINK name="*L1TStage2CaloLayer1/MismatchDetail/maxEvtMismatchByLumi*">
<TestName activate="true">Layer1MismatchThreshold</TestName>
</LINK>
<LINK name="*L1TStage2CaloLayer1/maxEvtMismatchByLumi">
<TestName activate="true">Layer1MismatchThreshold</TestName>
</LINK>
<LINK name="*L1TStage2CaloLayer1/MismatchDetail/maxEvtLinkErrorsByLumi*">
<TestName activate="true">Layer1LinkErrorThreshold</TestName>
</LINK>
<LINK name="*L1TStage2CaloLayer1/maxEvtLinkErrorsByLumi">
<TestName activate="true">Layer1LinkErrorThreshold</TestName>
</LINK>
</TESTSCONFIGURATION>

The only tests in the summary are from maximum mismatch/link error plots which are exactly identical between old code and new code the way the changes have been implemented. Unless one is doing incredibly careful inspection of the CaloL1 system while processing runs offline, this change will likely not be noticed unless it is already clear that the run has produced a lot of mismatches. The monitoring element is mostly intended as an online monitoring tool for being able to understand what issues are happening in what order in the trigger, which may help with diagnosis.

I am more concerned however about the performance of this monitoring element in a multithreaded environment, at offline. I have tested it for stability in online on actual data with multiple lumisections (seen here with a 2018 comissioning run 310150):

DataBasedLast20Mismatches

and at offline using the current runTheMatrix.py setup, but I am unsure if the sorting and maintaining functions I have written for its use across multiple streams will hold up and be stable. I know this PR has already been merged, but would it be possible for me to test the performance of this module on multiple streams to make sure it is safe for offline use before we might be surprised by bugs in its performance?

And so you are aware, after discussion with Alexander Savin (@asavincms) we may attempt to further patch this monitoring element in the future by not allowing for duplicate events to show up in multiple y axis positions .

@silviodonato
Copy link
Contributor

@aloeliger see #32172 and #32173

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants