Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053

AlanCoding · 2021-04-29T18:08:51Z

SUMMARY

The UI no longer follows the latest job events from websocket messages. Because of that, there's no reason to send messages for all events if the job event rate is high.

I used 30 because this is the number of events that I guesstimate will show in one page in the UI.

Needs the setting added in the UI.

This adds skip_websocket_message to event event_data. We could promote it to a top-level key for job events, if that is preferable aesthetically. Doing this allows us to test this feature without having to connect a websocket client. Ping @mabashian @chrismeyersfsu

ISSUE TYPE

Feature Pull Request

COMPONENT NAME

API
UI

ADDITIONAL INFORMATION

Scenario walkthrough:

a job is producing 1,000 events per second. User launches it, the screen fills up in, say 1/4 of a second. The scrollbar indicates content beyond the bottom of the screen. Now, for 3/4ths of a second, the scrollbar stays still. After that, it updates the scrollbar to the current line number that the job is on. The scrollbar continues to update the length of the output effectively once per second.

AlanCoding · 2021-04-30T18:55:42Z

I put together a python expression to give a visualization for real jobs.

' '.join([{True: 'x', False: 'o'}.get('skip_websocket_message' in e.event_data) for e in JobEvent.objects.filter(job_id=37).order_by('created')[:1000]])

For the job I'm testing locally, the average event rate is 426 per second, which is much higher than the ceiling, which is 30 per second.

I tested 2 new algorithms, which I will try to explain here.

First-pass algorithm, bursting, adds message timestamp to deque when emitted, skips messages when oldest timestamp is >1 second ago, and clears queue (except for oldest timestamp) when a message is skipped
Combo maximum instant or maximum average rate, emits message in either of 2 scenarios: when oldest timestamp in deque is >1 second ago or when newest timestamp in deque is >1/30 seconds ago (assuming setting is set to 30)
Only maximum instant rate, aside from unconditionally allowing the first 100 events, it allows all events where when the last emitted message was >1/30 seconds ago

Display for method (2):

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x o x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x o x x o x x x x x x o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o o o x o o o o o o o x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o x x x x x x x x o o o x x x x o x x x o x x o x x o o o o o o x x x x o x x x x x x x x x x x x x x x x x x o o x x x o x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o x o o o o o o x x x x x x x x o o o o o o x x x x x x x x x x o x x x o o o x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x

This sends messages for 6.4% of events

Display for method (3):

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x o x x o x x o x x o x x o x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x

This sends messages for 2.1% of events

Even though it is the most complex, I have a preference for algorithm (2). My reasoning is that if the UI tried to do following of standard out, this would give a mostly correct behavior because it could elegantly handle a burst of events. Yet, it still wouldn't fail to update the scroll bar within a human perceptible timeframe. Bursts of events are expected to be quite common, so it would be nice to keep the door open to this sort of "following".

I could even see the argument for adding clearing of the queue to algorithm (2) because patterns like " x o o o x o o o o o o o x x x" exist. If you have a stretch of standard out, you would prefer to not have holes in it, and a literal implementation of maximum-average-rate will not give you that. Thus, artificially zero-ing out the queue could patch this up at the cost of sending a higher fraction of events, and I think that's fine. One tradeoff we could do is to lower the max rate to something like 10 events per second, but divorce that from the queue length. This would have 2 separate configurable items - the max rate and the dead-band event number to stream before enforcing the rate.

However, I do think we should rule out algorithm (1), because it gives periods of time with no output for times of up to 1 second, and this is human-perceivable.

AlanCoding · 2021-05-04T13:15:20Z

Scratch what I said about "adding clearing of the queue to algorithm (2)".

If we wanted to do some kind of standard out "following" in the future, that becomes a conversation with different metrics involved. For that, you need a notion of a page size (again, we can guestimate 30) and will need to accept longer delays between events... probably on the order of 1/2 second instead of 1/30th second which this PR does right now.

I have updated this PR to reflect the option (2) which I laid out before. I have also used log messages to measure the event rate, and it tends to clock in at 30 to 31 messages/second. Because of the less restrictive criteria it overshoots the target by just a little bit. I'm happy with it, just looking for feedback.

AlanCoding · 2021-05-05T17:15:32Z

awx/main/dispatch/worker/callback.py

@@ -142,7 +142,8 @@ def flush(self, force=False):
                            logger.exception('Database Error Saving Job Event')
                duration_to_save = time.perf_counter() - duration_to_save
                for e in events:
-                    emit_event_detail(e)
+                    if not e.event_data.get('skip_websocket_message', False):


It's not completely unreasonable that we might change the .get to a .pop here and that'll remove the key from the API. It is useful for debugging purposes, but I don't know that we want it for production.

I'm pro change to pop.

AlanCoding · 2021-05-05T18:23:15Z

Two issues that I can fairly easily address in a new commit.

We should unconditionally exclude any events where stdout is blank and start_line==end_line, which are runner_on_start events and don't impact the UI
We should unconditionally include the minimal events

MINIMAL_EVENTS = set(['playbook_on_play_start', 'playbook_on_task_start', 'playbook_on_stats', 'EOF'])

This is a performance-sensitive decision, but I'd point toward handling these in the event_handler, so that these unconditional events can count toward the current event rate calculations.

AlanCoding · 2021-05-06T12:33:56Z

Updates:

The UI has added back in a "Follow" button. If the button is clicked:

output window tails the most recent job output
it will not show a scroll bar
every ~1 second it will update the output, populated by an API request

AlanCoding · 2021-05-06T17:08:12Z

I finished some touchups to the log, so I wanted to share what it illustrates

https://gist.github.com/AlanCoding/758ec1b7c4185af344880c0a62b2b8f6

The measured rates are only considering the messages that were sent. So what you see is that the instant rate is too high, so events are not sent until enough time has passed that the instant rate comes under the threshold.

AlanCoding · 2021-05-06T19:14:11Z

Update specifically for the UI engineers @mabashian @jakemcdermott:

To make "following" look okay while drawing events from websocket messages, I return to my prior point that the algorithm (1) is required. I had to do some updates to apply it on top of the current branch here.

Algorithm (1) explanation, bursting
We send events in continuous "bursts". The minimum length of a burst is 30 events. We don't start a burst if the event would push average rate over 30 / second. Thus, fast-producing jobs will quickly "spend" their 30 events, and further events will be skipped until the 1 second mark. This repeats every second for as long as the job event rate is over the 30 / second limit.

I had to adjust my visualization expression to account for blank events:

' '.join([{True: 'x', False: 'o'}.get('skip_websocket_message' in e.event_data) for e in JobEvent.objects.filter(job_id=302).order_by('created')[:1000] if e.stdout and e.start_line != e.end_line])

For a demo chatty_tasks job with algorithm (1) that looks like this:

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

The code for this is:

diff --git a/awx/main/tasks.py b/awx/main/tasks.py
index 664da13949..8eda5b1c45 100644
--- a/awx/main/tasks.py
+++ b/awx/main/tasks.py
@@ -1230,32 +1230,32 @@ class BaseTask(object):
         if self.recent_event_timings:
             cpu_time = time.time()
             first_window_time = self.recent_event_timings[0]
-            last_window_time = self.recent_event_timings[-1]
+            queue_size = len(self.recent_event_timings)
 
             if event_data.get('event') in MINIMAL_EVENTS:
                 should_emit = True  # always send some types like playbook_on_stats
             elif event_data.get('stdout') == '' and event_data['start_line'] == event_data['end_line']:
                 should_emit = False  # exclude events with no output
+            elif event_data.get('counter', 0) < 30:
+                should_emit = True  # grace period for first page of output
             else:
-                should_emit = any(
-                    [
-                        # if 30the most recent websocket message was sent over 1 second ago
-                        cpu_time - first_window_time > 1.0,
-                        # if the very last websocket message came in over 1/30 seconds ago
-                        self.recent_event_timings.maxlen * (cpu_time - last_window_time) > 1.0,
-                        # if the queue is not yet full
-                        len(self.recent_event_timings) != self.recent_event_timings.maxlen,
-                    ]
-                )
+                too_fast = bool(cpu_time - first_window_time < 1.0)  # if 30the most recent was sent < 1 second ago
+                is_full = bool(queue_size == self.recent_event_timings.maxlen)  # if the queue is full
+                inside_window = bool(queue_size > 2)  # if we have comitted to sending a window, we have to finish it
+
+                should_emit = (not too_fast) or inside_window
+
+                if is_full and too_fast:
+                    self.recent_event_timings.clear()
+                    self.recent_event_timings.append(first_window_time)
 
             logger.debug(
-                'Job {} event {} websocket send {}, queued: {}, rate - avg: {:.3f}, last: {:.3f}'.format(
+                'Job {} event {} websocket send {}, queued: {}, rate: {:.3f}'.format(
                     self.instance.id,
                     event_data.get('counter', 0),
                     should_emit,
-                    len(self.recent_event_timings),
-                    30.0 / (cpu_time - first_window_time),
-                    1.0 / (cpu_time - last_window_time),
+                    queue_size,
+                    30.0 / (cpu_time - first_window_time)
                 )
             )

This works well for me so far.

chrismeyersfsu · 2021-05-13T14:35:40Z

awx/main/tasks.py

+                    event_data.get('counter', 0),
+                    should_emit,
+                    len(self.recent_event_timings),
+                    30.0 / (cpu_time - first_window_time),


I not entirely sure, but I think you can end up dividing by 0 here.

At this point, I think it makes sense to just delete the log message.

softwarefactory-project-zuul · 2021-05-13T15:26:15Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 52s
awx-api : SUCCESS in 7m 57s
awx-ui : SUCCESS in 11m 05s
awx-swagger : SUCCESS in 6m 28s
awx-detect-schema-change : FAILURE in 8m 57s (non-voting)
awx-ansible-modules : SUCCESS in 4m 56s

softwarefactory-project-zuul · 2021-05-13T15:48:35Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 16s
awx-api : SUCCESS in 7m 20s
awx-ui : SUCCESS in 14m 05s
awx-swagger : SUCCESS in 6m 32s
awx-detect-schema-change : FAILURE in 6m 37s (non-voting)
awx-ansible-modules : SUCCESS in 4m 41s

softwarefactory-project-zuul · 2021-06-03T13:41:51Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 37s
awx-api : SUCCESS in 5m 47s
awx-ui : SUCCESS in 11m 59s
awx-swagger : SUCCESS in 6m 19s
awx-detect-schema-change : FAILURE in 6m 42s (non-voting)
awx-ansible-modules : SUCCESS in 4m 11s

When follow mode is enabled, fix the scroll position to the highest row so that the output panel is always displaying the latest events.

Always send websocket messages for high priority events like playbook_on_stats Never send websocket messages for events with no output unless they are a high priority event type

softwarefactory-project-zuul · 2021-06-08T17:47:03Z

Build succeeded.

awx-api-lint : SUCCESS in 6m 00s
awx-api : SUCCESS in 6m 11s
awx-ui : SUCCESS in 11m 54s
awx-swagger : SUCCESS in 8m 49s
awx-detect-schema-change : FAILURE in 6m 38s (non-voting)
awx-ansible-modules : SUCCESS in 4m 23s

softwarefactory-project-zuul · 2021-06-08T20:10:44Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 2m 31s
awx-api : SUCCESS in 6m 19s
awx-ui : SUCCESS in 13m 39s
awx-swagger : SUCCESS in 6m 45s
awx-detect-schema-change : FAILURE in 7m 10s (non-voting)
awx-ansible-modules : SUCCESS in 4m 21s
awx-push-new-schema : SUCCESS in 7m 39s (non-voting)

AlanCoding changed the title ~~[WIP] Intentionally drop job event websocket messages up to 30 per second (configurable)~~ [WIP] Intentionally drop job event websocket messages in excess of 30 per second (configurable) Apr 29, 2021

awxbot added type:enhancement component:api component:ui labels Apr 29, 2021

This comment has been minimized.

Sign in to view

tiagodread assigned one-t May 3, 2021

AlanCoding changed the title ~~[WIP] Intentionally drop job event websocket messages in excess of 30 per second (configurable)~~ Intentionally drop job event websocket messages in excess of 30 per second (configurable) May 4, 2021

ansible deleted a comment from softwarefactory-project-zuul bot May 4, 2021

AlanCoding marked this pull request as ready for review May 4, 2021 17:07

AlanCoding commented May 5, 2021

View reviewed changes

AlanCoding mentioned this pull request May 6, 2021

Add option to record websocket received time #10087

Closed

ansible deleted a comment from softwarefactory-project-zuul bot May 6, 2021

AlanCoding mentioned this pull request May 7, 2021

Add playbook to test bursts of messages ansible/test-playbooks#184

Merged

chrismeyersfsu reviewed May 13, 2021

View reviewed changes

ansible deleted a comment from softwarefactory-project-zuul bot May 13, 2021

jakemcdermott approved these changes May 21, 2021

View reviewed changes

jakemcdermott mentioned this pull request May 21, 2021

[ui_next] Many websocket connections created and maintained while navigating through app #9795

Closed

mabashian force-pushed the dropsies branch from 5fda170 to eedf8a3 Compare May 26, 2021 16:53

mabashian mentioned this pull request Jun 7, 2021

Add code collapse and code folding feature to output #6613

Closed

AlanCoding and others added 20 commits June 8, 2021 13:32

Enforce a 30-per-second max event websocket rate

b43d8e2

Move websocket skip logic into event_handler

b551608

Fix bug

cbb461a

Implement max event websocket rate as setting

01228ce

fix up log wording

faa0a6c

Further log adjustments

50ca2d4

THIS DOES NOT WORK pass events if they fit either timing criteria

70420dc

Finish up the immediate or average rate method

4b6b8f2

Put new setting in defaults so unit tests will run

b306c6f

make sure log format does not error

4052603

Add basic output tailing

768ac01

When follow mode is enabled, fix the scroll position to the highest row so that the output panel is always displaying the latest events.

Disable follow mode on scroll

faded27

Add option to record websocket received time

b919bef

Add some conditions for always-send and never-send event types

15effd7

Always send websocket messages for high priority events like playbook_on_stats Never send websocket messages for events with no output unless they are a high priority event type

Fix bug

53e8a9e

Move skip flag up from event_data and pop it off

210d508

Remove debugging log message

579d490

Adds logic to try to keep visible page accurate in follow mode

f0e7f2d

Fix linting errors

ddf4fbc

Go back to tracking currentlyLoading via state and not ref

3cc6a4c

AlanCoding force-pushed the dropsies branch from ab7ead9 to 3cc6a4c Compare June 8, 2021 17:34

chrismeyersfsu approved these changes Jun 8, 2021

View reviewed changes

AlanCoding added the mergeit label Jun 8, 2021

softwarefactory-project-zuul bot merged commit 3340ef9 into ansible:devel Jun 8, 2021

renovate bot mentioned this pull request Jun 18, 2021

chore(deps): update dependency awxkit to v19.2.1 osism-archive/container-image-manager#77

Merged

1 task

AlanCoding deleted the dropsies branch June 23, 2021 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053

Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053

AlanCoding commented Apr 29, 2021

This comment has been minimized.

AlanCoding commented Apr 30, 2021 •

edited

Loading

AlanCoding commented May 4, 2021

AlanCoding May 5, 2021

chrismeyersfsu May 13, 2021

AlanCoding commented May 5, 2021

AlanCoding commented May 6, 2021

AlanCoding commented May 6, 2021

AlanCoding commented May 6, 2021

chrismeyersfsu May 13, 2021

AlanCoding May 13, 2021

softwarefactory-project-zuul bot commented May 13, 2021

softwarefactory-project-zuul bot commented May 13, 2021

softwarefactory-project-zuul bot commented Jun 3, 2021

softwarefactory-project-zuul bot commented Jun 8, 2021

softwarefactory-project-zuul bot commented Jun 8, 2021

Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053

Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053

Conversation

AlanCoding commented Apr 29, 2021

SUMMARY

ISSUE TYPE

COMPONENT NAME

ADDITIONAL INFORMATION

This comment has been minimized.

AlanCoding commented Apr 30, 2021 • edited Loading

AlanCoding commented May 4, 2021

AlanCoding May 5, 2021

Choose a reason for hiding this comment

chrismeyersfsu May 13, 2021

Choose a reason for hiding this comment

AlanCoding commented May 5, 2021

AlanCoding commented May 6, 2021

AlanCoding commented May 6, 2021

AlanCoding commented May 6, 2021

chrismeyersfsu May 13, 2021

Choose a reason for hiding this comment

AlanCoding May 13, 2021

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented May 13, 2021

softwarefactory-project-zuul bot commented May 13, 2021

softwarefactory-project-zuul bot commented Jun 3, 2021

softwarefactory-project-zuul bot commented Jun 8, 2021

softwarefactory-project-zuul bot commented Jun 8, 2021

AlanCoding commented Apr 30, 2021 •

edited

Loading