Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closure of lumisections when concurrent lumisections are configured #42931

Open
smorovic opened this issue Oct 2, 2023 · 27 comments
Open

Closure of lumisections when concurrent lumisections are configured #42931

smorovic opened this issue Oct 2, 2023 · 27 comments

Comments

@smorovic
Copy link
Contributor

smorovic commented Oct 2, 2023

I am opening the issue based on discussion started in Mattermost.

In DAQ/HLT lumisections are coming sequentially from the input source with the increasing lumisection index. When events in a new lumisection are queued from the source, events from an older lumisection will no longer arrive. It is also possible that the source opens lumisection without putting events (cycling through empty lumisections), but the same incremental ordering is respected.

When enabling two concurrent lumisections in the framework, we see two independent issues:

  1. Lumisections that no longer receive events are closed late: If lumisection N is opened and events fully processed, then lumisection N + 1, and then lumisection N+2: the lumisection N will be closed only when N+2 is opened. This increases latency because we would like to have N closed around the time N+1 is opened (as soon after it finished processing events from N). Based on discussion so far, this appears to be related to the lumisection task in the internal queue being resumed and allowed to complete on new lumisection being opened and queued; As there are two places, N is dropped and closed before N+2 is added.
  2. When we have an event that does not finish for a long time in lumisection N, the lumisection N+1 will be open and processed (as expected), but N+2 will not be opened even after N+1 has finished all processing. This seems to be related to using a serial queue (tbb concurrent_queue) within the LimitedTaskQueue, where the only way to free a place is to drop the oldest lumisection, but in this case it is busy. Here it would be more ideal to be able to leave N behind and continue processing other lumisections.
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2023

A new Issue was created by @smorovic Srecko Morovic.

@makortel, @sextonkennedy, @rappoccio, @smuzaffar, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smorovic
Copy link
Contributor Author

smorovic commented Oct 2, 2023

assign core, daq

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2023

New categories assigned: core,daq

@Dr15Jones,@emeschi,@makortel,@smorovic,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@wddgit
Copy link
Contributor

wddgit commented Oct 2, 2023

One thing is worth noting before we continue. In the use case where one lumisection ends (lets call it N) and another starts (call it N+1) and there is only a little overlap between lumi N ending and lumi N+1 starting, the current behavior does not cause much of a problem. It only becomes a problem when the overlap is large enough that N ends after N+1 ends. This leads to the question of whether this will actually happen in realistic operations? Or is this something noticed in testing that is not significant in realistic operational conditions? I am asking. I don't know the answer.

@wddgit
Copy link
Contributor

wddgit commented Oct 2, 2023

I think the focus of discussion in Mattermost was on the lumiQueue_ that enforces the maximum number of luminosity blocks that can be processed concurrently. I don't think that is the problem.

I believe the constraint lies in the streamQueues_, if I am remembering correctly. I think the tasks that need to run to allow the luminosity block to end are waiting there. I am basing this off my memory, it would take some time to look more closely and verify I am remembering correctly.

@makortel
Copy link
Contributor

makortel commented Oct 2, 2023

I copy here one comment from @wddgit from the mattermost thread

Here is the issue. All streams must complete all lumis. The oldest lumi (lumi N in our discussion) is keeping at least one stream busy because it is still busy processing its last event or its end lumi work, so that stream cannot process lumi N+1 even if it only does begin stream lumi and end stream lumi transitions and there are no events for that stream. Actually this does prevent lumi N+1 from ending until lumi N completes (not as long as possible, but until earlier lumis complete). If I am remembering this correctly.

and my corollary

The design is that every stream module sees every lumi (as to mimic the behavior of legacy modules). Breaking that mental model would have consequences that need to be thought carefully.

@makortel
Copy link
Contributor

makortel commented Oct 2, 2023

@smorovic With the present behavior, in the case 2 (long-running event in lumi N) is the concurrent lumis still more useful than processing lumis serially (i.e. lumi N+1 starts processing only after lumi N has finished)?

@smorovic
Copy link
Contributor Author

smorovic commented Oct 2, 2023

On Sunday this happened during heavy-ion stable beams when testing a new HLT menu. There were two occurrences within around 20 lumisections of each other of events taking quite a lot of time to finish (over 15 min). It blocked only two processes of around 1500 so it's not a big impact since it was taking very little CPU out of the full HLT (not sure if that could change with heavy-ion ramp-up).
I think concurrency is still usefull, especially if apart from events taking minutes we also have distribution tails of events taking seconds. Those would be mostly absorbed within the ~1 LS concurrency window without stalling next lumi.
Also the concurrency could be even a bit more than 2 if that becomes worse. But from that side we are limited by how many lumisections we can buffer because they take space on local ramdisks and are freed only when they are closed.

@wddgit
Copy link
Contributor

wddgit commented Oct 2, 2023

Most of my comments above have been discussing item 2 listed at the top of this Issue.

I do not understand item 1. I am not aware of any hard scheduling requirement that would hold lumi N open until lumi N+2 was ready to start. I'm surprised lumi N does not close as soon as its events are complete. The only possibility I can think of is that the tasks are spawned and are ready to run but in the competition between tasks as to which runs next those tasks just always lose and there is never any time when there is an available thread without a competing task. Maybe, seems unlikely, but maybe... Someone should spend some time debugging if that is really happening.

@makortel
Copy link
Contributor

makortel commented Oct 5, 2023

@smorovic

  1. Lumisections that no longer receive events are closed late: If lumisection N is opened and events fully processed, then lumisection N + 1, and then lumisection N+2: the lumisection N will be closed only when N+2 is opened. This increases latency because we would like to have N closed around the time N+1 is opened (as soon after it finished processing events from N).

Could you give more details about this case? What actions exactly do you refer to with "lumisection is opened" and "lumisection is closed"? Is the behavior correlated with long-running events, or does it occur independently?

Could you remind me what Source is being used in DAQ?

@smorovic
Copy link
Contributor Author

smorovic commented Oct 5, 2023

I mean globalBeginLumi and globalEndLumi transition. This is driven by the input source when returning new lumisections.
We use FedRawDataInputSource.
Opening of a lumisection with or without a new event (I think without event is when setEventCached is not set) happens here:
https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/src/FedRawDataInputSource.cc#L287

I tried in the last few days to reproduce it in a simple example, but in the end wasn't successful (though I didn't try yet with many streams/threads, just with the two).
In production system we can clearly see it in HLT latency monitoring. HLT cycles through lumisections at regular ~23s intervals. But two lumisections are sometimes closed together, after around ~45 seconds after the last previous one and this is quite common when input event rate to the cluster is low (so processes see a few or zero events in a lumisection).

This could also happen e.g. if in our system lumisection boundary somehow isn't always detected by the input source even when there are no events, but I think it isn't the case based on analysis I did (below).

We have also monitoring which samples stream states every ~2 seconds. From this it can be seen if the last activity in the monitoring service was {pre,post}Event, {pre,post}Begin{global,stream}Lumi etc. Each snapshot also rotates through all the streams, so each snapshot is another one.
Since we don't log CMSSW transitions, this helped me analyze one process where the gap in endLumi happened. In lumisection N it had about 60 events and processed them, after which it could be seen the state was "postEvent" at least for a couple of streams. A few seconds later, around the time of a new lumisection N+1 being opened (which for the whole duration did not have any events for that process), some streams changed state to postStreamEndLumi (which must be for the N lumi), but some still remained in postEvent (which also must be N because N+1 had no events).
Only 23 seconds later, when N+2 came, both N and N+1 got ended in CMSSW.

This shows that the source did return a new lumi N+1, but not all framework streams switched to it (some stayed in postEvent).
If I understood correctly, lumisection remains open until all streams have opened and closed it, so that would be enough to to postpone closure.

I'll see if I can reproduce it more simply (I think I did have it at some point in a simple setup), but not sure it will be an example that can work outside of the DAQ environment.

@wddgit
Copy link
Contributor

wddgit commented Oct 6, 2023

I've read through the last post several times. I do not understand what is going on. The only theory we have so far as to why lumi N is delayed in closing is that there are competing tasks keeping the threads busy and the tasks that would close lumi N don't run because of the competition. But if you are saying there are no events in the next lumi and the system is just sitting there waiting. Well, it seems like TBB would run all the tasks and everything would be done, including the tasks to close lumi N. One question that occurs to me. When the source is waiting, is the Framework is the middle of a function call, like getNextItemType. Is the Framework is just waiting for that or some other function to return while the source waits for more events?

@smorovic
Copy link
Contributor Author

smorovic commented Oct 6, 2023

Yes, edm::RawInputSource::Next FedRawDataInputSource::checkNext() has a loop within, which will not return if nothing new appears in the input side.
It breaks the loop and returns when it has a new lumisection, an event, end of run, or on a termination flag.

On transition to new lumisection (N+1), if one is without events, it will return only once "Next::kEvent", then next call it should be in that loop for a while until N+2.
In case this call needs to return to the each stream (or return more frequently), then it could be a problem.

@smorovic
Copy link
Contributor Author

Hi @makortel
I managed to reproduce the problem 1. in a simple way. Commit in this branch (13_2_X) contains RunBUFU2.sh script to do it with relevant printouts:
https://github.com/smorovic/cmssw/tree/132x-testing-ls-transitions
smorovic@451d980

There is a sleep in the input thread, separate from the framework and tbb pool. It sends new lumi to the 'main' waiting loop (within the checkNext), sleeps 10s and then also sends events from that lumisection.
On the new lumi, checkNext does return immediately to the external caller.

But the framework does not open that lumisection for the next 10 seconds (~ sleep time).

Concerning the lumisection closure, sometimes the previous lumisection manages to close immediately on checkNext return, at other times this happens only after 10 seconds.

It seems it is also happening with only 1 concurrent lumisection allowed.

@makortel
Copy link
Contributor

Thanks @smorovic. I think we have now understood why the behavior is what you observe. We need to think more if and how to address the causes, as there would be some non-straightforward consequences.

@wddgit
Copy link
Contributor

wddgit commented Oct 13, 2023

There is a new unit test that reproduces the first item described in the comment at the top of this Issue. See pull request #43018. It is not a fix but we understand the behavior now. The time while checkNext is waiting for the next event holds open a task running in the source serial queue and this prevents the lumi from closing. I don't think there are any other negative consequences to this other than delaying closing the lumi (only happens online when checkNext does not return quickly enough). We're still discussing what to do about this. Any change would be non-trivial in a fairly complex part of the code.

@wddgit
Copy link
Contributor

wddgit commented Nov 9, 2023

PR #43240 should fix item 1 in the opening comment of this issue. It does not help with the other one. Please report it if you're still seeing the problem after this gets into the release you are using. If you are set up to run a test with this easily, that might be worth running at some point.

@smorovic
Copy link
Contributor Author

smorovic commented Nov 9, 2023

Thank you.
When this ends in a release, we can try it out in a global run online.
It can be seen in form of fluctuations, as +/- 1 lumi in closure time as seen in the picture(cosmic run), and that should diminish with timely closure:
image

@Dr15Jones
Copy link
Contributor

@smorovic the new #43174 is also intended to help understand the frameworks scheduling behavior.

@wddgit
Copy link
Contributor

wddgit commented Nov 13, 2023

After some discussion, we decided to close #43240 and replaced it with a new pull request #43263.

(kind of repeating myself...) This should fix item 1 in the opening comment of this issue. It does not help with the other one. Please report it if you're still seeing the problem after this gets into the release you are using. If you are set up to run a test with this easily, that might be worth running at some point.

@wddgit
Copy link
Contributor

wddgit commented Nov 20, 2023

The fix was merged. I expect it to show up in 14_0_0. I was not planning to backport (unless there is some urgent need and someone requests it).

@smorovic
Copy link
Contributor Author

I ran a test today and in typical conditions (low-rate, and after 1st lumi), end-of-lumi merging latency stays low, taking a few seconds (no more >=1 LS spikes as before):

ls_latency_14_X_patch

@wddgit
Copy link
Contributor

wddgit commented Nov 30, 2023

Sounds good.

@wddgit
Copy link
Contributor

wddgit commented Mar 15, 2024

PR #43522 was merged recently and should be included in release 14_1_0_pre2. This should fix item 2 in the opening comment of this issue. It allows Streams to skip lumis if all the Events are processed by other Streams and at least one Stream already processed the lumi. This allows the Framework to close a lumi whose events are completed even if there is a stream busy processing an extremely long event from an earlier lumi.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 15, 2024

cms-bot internal usage

@wddgit
Copy link
Contributor

wddgit commented Mar 15, 2024

At least for actions from our side, I think this issue can now be closed.

@makortel
Copy link
Contributor

+core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants