-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053
Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053
Conversation
This comment has been minimized.
This comment has been minimized.
I put together a python expression to give a visualization for real jobs. ' '.join([{True: 'x', False: 'o'}.get('skip_websocket_message' in e.event_data) for e in JobEvent.objects.filter(job_id=37).order_by('created')[:1000]]) For the job I'm testing locally, the average event rate is 426 per second, which is much higher than the ceiling, which is 30 per second. I tested 2 new algorithms, which I will try to explain here.
Display for method (2): o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x o x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x o x x o x x x x x x o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o o o x o o o o o o o x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o x x x x x x x x o o o x x x x o x x x o x x o x x o o o o o o x x x x o x x x x x x x x x x x x x x x x x x o o x x x o x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o x o o o o o o x x x x x x x x o o o o o o x x x x x x x x x x o x x x o o o x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x This sends messages for 6.4% of events Display for method (3): o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x o x x o x x o x x o x x o x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x This sends messages for 2.1% of events Even though it is the most complex, I have a preference for algorithm (2). My reasoning is that if the UI tried to do following of standard out, this would give a mostly correct behavior because it could elegantly handle a burst of events. Yet, it still wouldn't fail to update the scroll bar within a human perceptible timeframe. Bursts of events are expected to be quite common, so it would be nice to keep the door open to this sort of "following". I could even see the argument for adding clearing of the queue to algorithm (2) because patterns like " x o o o x o o o o o o o x x x" exist. If you have a stretch of standard out, you would prefer to not have holes in it, and a literal implementation of maximum-average-rate will not give you that. Thus, artificially zero-ing out the queue could patch this up at the cost of sending a higher fraction of events, and I think that's fine. One tradeoff we could do is to lower the max rate to something like 10 events per second, but divorce that from the queue length. This would have 2 separate configurable items - the max rate and the dead-band event number to stream before enforcing the rate. However, I do think we should rule out algorithm (1), because it gives periods of time with no output for times of up to 1 second, and this is human-perceivable. |
Scratch what I said about "adding clearing of the queue to algorithm (2)". If we wanted to do some kind of standard out "following" in the future, that becomes a conversation with different metrics involved. For that, you need a notion of a page size (again, we can guestimate 30) and will need to accept longer delays between events... probably on the order of 1/2 second instead of 1/30th second which this PR does right now. I have updated this PR to reflect the option (2) which I laid out before. I have also used log messages to measure the event rate, and it tends to clock in at 30 to 31 messages/second. Because of the less restrictive criteria it overshoots the target by just a little bit. I'm happy with it, just looking for feedback. |
awx/main/dispatch/worker/callback.py
Outdated
@@ -142,7 +142,8 @@ def flush(self, force=False): | |||
logger.exception('Database Error Saving Job Event') | |||
duration_to_save = time.perf_counter() - duration_to_save | |||
for e in events: | |||
emit_event_detail(e) | |||
if not e.event_data.get('skip_websocket_message', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not completely unreasonable that we might change the .get
to a .pop
here and that'll remove the key from the API. It is useful for debugging purposes, but I don't know that we want it for production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pro change to pop.
Two issues that I can fairly easily address in a new commit.
MINIMAL_EVENTS = set(['playbook_on_play_start', 'playbook_on_task_start', 'playbook_on_stats', 'EOF']) This is a performance-sensitive decision, but I'd point toward handling these in the |
Updates: The UI has added back in a "Follow" button. If the button is clicked:
|
I finished some touchups to the log, so I wanted to share what it illustrates https://gist.github.com/AlanCoding/758ec1b7c4185af344880c0a62b2b8f6 The measured rates are only considering the messages that were sent. So what you see is that the instant rate is too high, so events are not sent until enough time has passed that the instant rate comes under the threshold. |
Update specifically for the UI engineers @mabashian @jakemcdermott: To make "following" look okay while drawing events from websocket messages, I return to my prior point that the algorithm (1) is required. I had to do some updates to apply it on top of the current branch here. Algorithm (1) explanation, bursting I had to adjust my visualization expression to account for blank events: ' '.join([{True: 'x', False: 'o'}.get('skip_websocket_message' in e.event_data) for e in JobEvent.objects.filter(job_id=302).order_by('created')[:1000] if e.stdout and e.start_line != e.end_line]) For a demo chatty_tasks job with algorithm (1) that looks like this: o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x The code for this is: diff --git a/awx/main/tasks.py b/awx/main/tasks.py
index 664da13949..8eda5b1c45 100644
--- a/awx/main/tasks.py
+++ b/awx/main/tasks.py
@@ -1230,32 +1230,32 @@ class BaseTask(object):
if self.recent_event_timings:
cpu_time = time.time()
first_window_time = self.recent_event_timings[0]
- last_window_time = self.recent_event_timings[-1]
+ queue_size = len(self.recent_event_timings)
if event_data.get('event') in MINIMAL_EVENTS:
should_emit = True # always send some types like playbook_on_stats
elif event_data.get('stdout') == '' and event_data['start_line'] == event_data['end_line']:
should_emit = False # exclude events with no output
+ elif event_data.get('counter', 0) < 30:
+ should_emit = True # grace period for first page of output
else:
- should_emit = any(
- [
- # if 30the most recent websocket message was sent over 1 second ago
- cpu_time - first_window_time > 1.0,
- # if the very last websocket message came in over 1/30 seconds ago
- self.recent_event_timings.maxlen * (cpu_time - last_window_time) > 1.0,
- # if the queue is not yet full
- len(self.recent_event_timings) != self.recent_event_timings.maxlen,
- ]
- )
+ too_fast = bool(cpu_time - first_window_time < 1.0) # if 30the most recent was sent < 1 second ago
+ is_full = bool(queue_size == self.recent_event_timings.maxlen) # if the queue is full
+ inside_window = bool(queue_size > 2) # if we have comitted to sending a window, we have to finish it
+
+ should_emit = (not too_fast) or inside_window
+
+ if is_full and too_fast:
+ self.recent_event_timings.clear()
+ self.recent_event_timings.append(first_window_time)
logger.debug(
- 'Job {} event {} websocket send {}, queued: {}, rate - avg: {:.3f}, last: {:.3f}'.format(
+ 'Job {} event {} websocket send {}, queued: {}, rate: {:.3f}'.format(
self.instance.id,
event_data.get('counter', 0),
should_emit,
- len(self.recent_event_timings),
- 30.0 / (cpu_time - first_window_time),
- 1.0 / (cpu_time - last_window_time),
+ queue_size,
+ 30.0 / (cpu_time - first_window_time)
)
) This works well for me so far. |
awx/main/tasks.py
Outdated
event_data.get('counter', 0), | ||
should_emit, | ||
len(self.recent_event_timings), | ||
30.0 / (cpu_time - first_window_time), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I not entirely sure, but I think you can end up dividing by 0 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I think it makes sense to just delete the log message.
Build succeeded.
|
Build succeeded.
|
Build succeeded.
|
When follow mode is enabled, fix the scroll position to the highest row so that the output panel is always displaying the latest events.
Always send websocket messages for high priority events like playbook_on_stats Never send websocket messages for events with no output unless they are a high priority event type
Build succeeded.
|
Build succeeded (gate pipeline).
|
SUMMARY
The UI no longer follows the latest job events from websocket messages. Because of that, there's no reason to send messages for all events if the job event rate is high.
I used 30 because this is the number of events that I guesstimate will show in one page in the UI.
Needs the setting added in the UI.
This adds
skip_websocket_message
to eventevent_data
. We could promote it to a top-level key for job events, if that is preferable aesthetically. Doing this allows us to test this feature without having to connect a websocket client. Ping @mabashian @chrismeyersfsuISSUE TYPE
COMPONENT NAME
ADDITIONAL INFORMATION
Scenario walkthrough:
a job is producing 1,000 events per second. User launches it, the screen fills up in, say 1/4 of a second. The scrollbar indicates content beyond the bottom of the screen. Now, for 3/4ths of a second, the scrollbar stays still. After that, it updates the scrollbar to the current line number that the job is on. The scrollbar continues to update the length of the output effectively once per second.