Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edm::Service callback for lumi/run cleanup #28521

Closed
schneiml opened this issue Dec 2, 2019 · 10 comments
Closed

edm::Service callback for lumi/run cleanup #28521

schneiml opened this issue Dec 2, 2019 · 10 comments

Comments

@schneiml
Copy link
Contributor

schneiml commented Dec 2, 2019

Dear framework experts,

while porting the "new DQMStore" work back to work with a edm::Service based DQMStore, I face a problem related to which callbacks are available to the service.

In the DQMStore, we now essentially emulate the product life cycle in edm: histograms are created as a module enters a lumisection/run, kept while that lumi/run has not ended, saved to one or more output files in output modules or EDAnalyzers, and then removed and recycled once the run/lumi is completely processed. One important detail here is that we whenever possible reuse the MonitorElement/TH1 objects: this is to make sure that when there are no concurrent runs/lumis, there is exactly one object per histogram and the legacy modules can keep pointers to that, and continue working correctly.

Now, the problem is that the DQMStore has to delete (or rather "mark for recycling") the histograms for a lumisection after all output modules have written the their output. As far as I can see, there is no callback for that.

My current interim implementation cleans up at the beginning of the next lumisection, but that can't work with concurrent lumis. Also it does not work with the DQMRootSource, since it wants to read all histograms before triggering the start of the run/lumi. In the past, the cleanup was triggered by the output modules themselves, which is deeply flawed since there might be more than one of them (even in practice, usually MEtoEDM and DQMRootOutputModule were loaded at the same time).

Find the current WIP here:

mkdir -p /tmp/$USER
cd /tmp/$USER
cmsrel CMSSW_11_0_X_2019-11-26-2300
cd CMSSW_11_0_X_2019-11-26-2300/src
cmsenv
git cms-init
git cms-merge-topic schneiml:dqm-prepare-for-stream-v2
git checkout f603f9fd9dda8203fe5387ef7ba75b75c6a594bd
cd DQMServices/
scram b -j12
cd FwkIO/test
LOCAL_TEST_DIR=. LOCAL_TMP_DIR=. ./run_tests.sh

read_file1_file2_cfg.py fails because the second lumi is read before the first one is cleaned up, therefore new objects need to be allocated; after the first lumi is cleaned up, the old pointers are invalid and DummyReadDQMStore crashes.. This could easily be avoided by fixing DummyReadDQMStore, but I believe we should support this (full legacy behavior as long as there are no concurrent runs/lumis). The flawed DQMStore logic is DQMServices/Core/src/DQMStore.cc:506-520.

Is there an edm::Service callback that could be used (after a lumi is fully processed, but before the next one starts reading)? Could the DQMRootSource be modified to only start reading histograms after the global begin lumi/run was triggered? Would it be acceptable to count the output modules and count watchPostModuleWrite* events?

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 2, 2019

A new Issue was created by @schneiml Marcel Schneider.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

@schneiml I would think that the natural way to handle such re-use would be to use a shared reference count. If the count is 0 by the time you need to 'rebook' you can just re-use it.

@Dr15Jones
Copy link
Contributor

@schneiml I probably missed something. You are trying to make the new interface work with the old system involving the DQMStore service? Does that mean you would not be putting data products into the Run and LuminosityBlock for this?

@Dr15Jones
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2019

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@schneiml
Copy link
Contributor Author

schneiml commented Dec 5, 2019

@Dr15Jones The current plan for 11_1 is to have DQMTokens everywhere (except legacy modules), but keep all the data inside the DQMStore, in an organisation that is a bit closer to products than "one big pile of histograms". This decision is because there are many unsolved problems with products only: online DQM, DQM@HLT, legacy modules, multi-run harvesting and end-job histograms, etc.

This way, we should be able to still support all the legacy modules and workflows like online DQM while running without concurrent-anything (potentially even multithreaded, though that is not really required), while also allowing concurrent-anything multi-threaded for non-legacy modules (and also be a bit more memory-safe, even if incorrect, for legacy modules that end up in multi-thread concurrent situations).

That seems like a good deal to me: we get a lot of the cleanup, but don't break anybodies workflow (yet), get safety from undefined behavior and make the implementation more simple. Moving to products should become much easier in the future, but needs more redesign around the non-standard workflows.

Re the ref-counting solution, the problem is that I can't rely on getting (non-edm) callbacks from the modules. For the non-legacy modules, these callbacks drive everything (like in the old DQMStore, btw), but I'd also like to get things correct for modules that just talk to the service at random times (which was a pile of hacks that only worked in "legacy mode" in the old DQMStore). It might be possible to do it using only edm callbacks, that is what I meant by "counting modules" above: I can just listen to all the events and keep track which modules are in which lumisections. I'd preferred sth. simpler, but it sure could be made to work.

@schneiml
Copy link
Contributor Author

schneiml commented Dec 5, 2019

BTW, another idea I had was to trigger the ref-count/cleanup using edm::Service calls from the DQMToken constructor/destructor. But I don't see how this could be done without a dependency from DataFormats to DQMServices, and also it feels like a very, very bad idea...

@Dr15Jones
Copy link
Contributor

My thought was to share a std::atomic<bool> between the class holding the histograms and the DQMToken. When the token gets made, it sets the bool to true. When the token gets destroyed, it sets the bool to false. To determine if a given histogram can be re-used, the DQMStore would just look at the value of the bool. In that way, no callbacks are needed.

@schneiml
Copy link
Contributor Author

schneiml commented Dec 6, 2019

@Dr15Jones hmm, that way the dependency could be avoided. I still think this sounds rather fragile, esp. since the correctness for legacy modules depends on the cleanup happening at exactly the right time (if a module holds a ME/TH1 pointer and expects it to stay valid across run/lumi boundaries, we need to be really sure to never allocate a copy).

Since you provide the signal in #28562, I guess I'll just go with that. Thanks a lot!

@schneiml
Copy link
Contributor Author

schneiml commented Dec 9, 2019

My latest development branch master...schneiml:dqm-prepare-for-stream-v3 now uses the new callbacks, and that seems to work well. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants