-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closure of lumisections when concurrent lumisections are configured #42931
Comments
A new Issue was created by @smorovic Srecko Morovic. @makortel, @sextonkennedy, @rappoccio, @smuzaffar, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core, daq |
New categories assigned: core,daq @Dr15Jones,@emeschi,@makortel,@smorovic,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
One thing is worth noting before we continue. In the use case where one lumisection ends (lets call it N) and another starts (call it N+1) and there is only a little overlap between lumi N ending and lumi N+1 starting, the current behavior does not cause much of a problem. It only becomes a problem when the overlap is large enough that N ends after N+1 ends. This leads to the question of whether this will actually happen in realistic operations? Or is this something noticed in testing that is not significant in realistic operational conditions? I am asking. I don't know the answer. |
I think the focus of discussion in Mattermost was on the lumiQueue_ that enforces the maximum number of luminosity blocks that can be processed concurrently. I don't think that is the problem. I believe the constraint lies in the streamQueues_, if I am remembering correctly. I think the tasks that need to run to allow the luminosity block to end are waiting there. I am basing this off my memory, it would take some time to look more closely and verify I am remembering correctly. |
I copy here one comment from @wddgit from the mattermost thread
and my corollary
|
@smorovic With the present behavior, in the case 2 (long-running event in lumi N) is the concurrent lumis still more useful than processing lumis serially (i.e. lumi N+1 starts processing only after lumi N has finished)? |
On Sunday this happened during heavy-ion stable beams when testing a new HLT menu. There were two occurrences within around 20 lumisections of each other of events taking quite a lot of time to finish (over 15 min). It blocked only two processes of around 1500 so it's not a big impact since it was taking very little CPU out of the full HLT (not sure if that could change with heavy-ion ramp-up). |
Most of my comments above have been discussing item 2 listed at the top of this Issue. I do not understand item 1. I am not aware of any hard scheduling requirement that would hold lumi N open until lumi N+2 was ready to start. I'm surprised lumi N does not close as soon as its events are complete. The only possibility I can think of is that the tasks are spawned and are ready to run but in the competition between tasks as to which runs next those tasks just always lose and there is never any time when there is an available thread without a competing task. Maybe, seems unlikely, but maybe... Someone should spend some time debugging if that is really happening. |
Could you give more details about this case? What actions exactly do you refer to with "lumisection is opened" and "lumisection is closed"? Is the behavior correlated with long-running events, or does it occur independently? Could you remind me what Source is being used in DAQ? |
I mean globalBeginLumi and globalEndLumi transition. This is driven by the input source when returning new lumisections. I tried in the last few days to reproduce it in a simple example, but in the end wasn't successful (though I didn't try yet with many streams/threads, just with the two). This could also happen e.g. if in our system lumisection boundary somehow isn't always detected by the input source even when there are no events, but I think it isn't the case based on analysis I did (below). We have also monitoring which samples stream states every ~2 seconds. From this it can be seen if the last activity in the monitoring service was {pre,post}Event, {pre,post}Begin{global,stream}Lumi etc. Each snapshot also rotates through all the streams, so each snapshot is another one. This shows that the source did return a new lumi N+1, but not all framework streams switched to it (some stayed in postEvent). I'll see if I can reproduce it more simply (I think I did have it at some point in a simple setup), but not sure it will be an example that can work outside of the DAQ environment. |
I've read through the last post several times. I do not understand what is going on. The only theory we have so far as to why lumi N is delayed in closing is that there are competing tasks keeping the threads busy and the tasks that would close lumi N don't run because of the competition. But if you are saying there are no events in the next lumi and the system is just sitting there waiting. Well, it seems like TBB would run all the tasks and everything would be done, including the tasks to close lumi N. One question that occurs to me. When the source is waiting, is the Framework is the middle of a function call, like getNextItemType. Is the Framework is just waiting for that or some other function to return while the source waits for more events? |
Yes, On transition to new lumisection (N+1), if one is without events, it will return only once "Next::kEvent", then next call it should be in that loop for a while until N+2. |
Hi @makortel There is a sleep in the input thread, separate from the framework and tbb pool. It sends new lumi to the 'main' waiting loop (within the But the framework does not open that lumisection for the next 10 seconds (~ sleep time). Concerning the lumisection closure, sometimes the previous lumisection manages to close immediately on checkNext return, at other times this happens only after 10 seconds. It seems it is also happening with only 1 concurrent lumisection allowed. |
Thanks @smorovic. I think we have now understood why the behavior is what you observe. We need to think more if and how to address the causes, as there would be some non-straightforward consequences. |
There is a new unit test that reproduces the first item described in the comment at the top of this Issue. See pull request #43018. It is not a fix but we understand the behavior now. The time while checkNext is waiting for the next event holds open a task running in the source serial queue and this prevents the lumi from closing. I don't think there are any other negative consequences to this other than delaying closing the lumi (only happens online when checkNext does not return quickly enough). We're still discussing what to do about this. Any change would be non-trivial in a fairly complex part of the code. |
PR #43240 should fix item 1 in the opening comment of this issue. It does not help with the other one. Please report it if you're still seeing the problem after this gets into the release you are using. If you are set up to run a test with this easily, that might be worth running at some point. |
After some discussion, we decided to close #43240 and replaced it with a new pull request #43263. (kind of repeating myself...) This should fix item 1 in the opening comment of this issue. It does not help with the other one. Please report it if you're still seeing the problem after this gets into the release you are using. If you are set up to run a test with this easily, that might be worth running at some point. |
The fix was merged. I expect it to show up in 14_0_0. I was not planning to backport (unless there is some urgent need and someone requests it). |
Sounds good. |
PR #43522 was merged recently and should be included in release 14_1_0_pre2. This should fix item 2 in the opening comment of this issue. It allows Streams to skip lumis if all the Events are processed by other Streams and at least one Stream already processed the lumi. This allows the Framework to close a lumi whose events are completed even if there is a stream busy processing an extremely long event from an earlier lumi. |
cms-bot internal usage |
At least for actions from our side, I think this issue can now be closed. |
+core |
I am opening the issue based on discussion started in Mattermost.
In DAQ/HLT lumisections are coming sequentially from the input source with the increasing lumisection index. When events in a new lumisection are queued from the source, events from an older lumisection will no longer arrive. It is also possible that the source opens lumisection without putting events (cycling through empty lumisections), but the same incremental ordering is respected.
When enabling two concurrent lumisections in the framework, we see two independent issues:
The text was updated successfully, but these errors were encountered: