Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intentionally drop job event websocket messages in excess of 30 per second (configurable) #10053

Merged
merged 20 commits into from
Jun 8, 2021

Conversation

AlanCoding
Copy link
Member

SUMMARY

The UI no longer follows the latest job events from websocket messages. Because of that, there's no reason to send messages for all events if the job event rate is high.

I used 30 because this is the number of events that I guesstimate will show in one page in the UI.

Needs the setting added in the UI.

This adds skip_websocket_message to event event_data. We could promote it to a top-level key for job events, if that is preferable aesthetically. Doing this allows us to test this feature without having to connect a websocket client. Ping @mabashian @chrismeyersfsu

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API
  • UI
ADDITIONAL INFORMATION

Scenario walkthrough:

a job is producing 1,000 events per second. User launches it, the screen fills up in, say 1/4 of a second. The scrollbar indicates content beyond the bottom of the screen. Now, for 3/4ths of a second, the scrollbar stays still. After that, it updates the scrollbar to the current line number that the job is on. The scrollbar continues to update the length of the output effectively once per second.

@AlanCoding AlanCoding changed the title [WIP] Intentionally drop job event websocket messages up to 30 per second (configurable) [WIP] Intentionally drop job event websocket messages in excess of 30 per second (configurable) Apr 29, 2021
@AlanCoding

This comment has been minimized.

@AlanCoding
Copy link
Member Author

AlanCoding commented Apr 30, 2021

I put together a python expression to give a visualization for real jobs.

' '.join([{True: 'x', False: 'o'}.get('skip_websocket_message' in e.event_data) for e in JobEvent.objects.filter(job_id=37).order_by('created')[:1000]])

For the job I'm testing locally, the average event rate is 426 per second, which is much higher than the ceiling, which is 30 per second.

I tested 2 new algorithms, which I will try to explain here.

  1. First-pass algorithm, bursting, adds message timestamp to deque when emitted, skips messages when oldest timestamp is >1 second ago, and clears queue (except for oldest timestamp) when a message is skipped
  2. Combo maximum instant or maximum average rate, emits message in either of 2 scenarios: when oldest timestamp in deque is >1 second ago or when newest timestamp in deque is >1/30 seconds ago (assuming setting is set to 30)
  3. Only maximum instant rate, aside from unconditionally allowing the first 100 events, it allows all events where when the last emitted message was >1/30 seconds ago

Display for method (2):

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x o x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x o x x o x x x x x x o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o o o x o o o o o o o x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o x x x x x x x x o o o x x x x o x x x o x x o x x o o o o o o x x x x o x x x x x x x x x x x x x x x x x x o o x x x o x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o x o o o o o o x x x x x x x x o o o o o o x x x x x x x x x x o x x x o o o x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x

This sends messages for 6.4% of events

Display for method (3):

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x o x x o x x o x x o x x o x x o x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o x x x x x x x x x

This sends messages for 2.1% of events


Even though it is the most complex, I have a preference for algorithm (2). My reasoning is that if the UI tried to do following of standard out, this would give a mostly correct behavior because it could elegantly handle a burst of events. Yet, it still wouldn't fail to update the scroll bar within a human perceptible timeframe. Bursts of events are expected to be quite common, so it would be nice to keep the door open to this sort of "following".

I could even see the argument for adding clearing of the queue to algorithm (2) because patterns like " x o o o x o o o o o o o x x x" exist. If you have a stretch of standard out, you would prefer to not have holes in it, and a literal implementation of maximum-average-rate will not give you that. Thus, artificially zero-ing out the queue could patch this up at the cost of sending a higher fraction of events, and I think that's fine. One tradeoff we could do is to lower the max rate to something like 10 events per second, but divorce that from the queue length. This would have 2 separate configurable items - the max rate and the dead-band event number to stream before enforcing the rate.

However, I do think we should rule out algorithm (1), because it gives periods of time with no output for times of up to 1 second, and this is human-perceivable.

@AlanCoding
Copy link
Member Author

Scratch what I said about "adding clearing of the queue to algorithm (2)".

If we wanted to do some kind of standard out "following" in the future, that becomes a conversation with different metrics involved. For that, you need a notion of a page size (again, we can guestimate 30) and will need to accept longer delays between events... probably on the order of 1/2 second instead of 1/30th second which this PR does right now.

I have updated this PR to reflect the option (2) which I laid out before. I have also used log messages to measure the event rate, and it tends to clock in at 30 to 31 messages/second. Because of the less restrictive criteria it overshoots the target by just a little bit. I'm happy with it, just looking for feedback.

@AlanCoding AlanCoding changed the title [WIP] Intentionally drop job event websocket messages in excess of 30 per second (configurable) Intentionally drop job event websocket messages in excess of 30 per second (configurable) May 4, 2021
@ansible ansible deleted a comment from softwarefactory-project-zuul bot May 4, 2021
@ansible ansible deleted a comment from softwarefactory-project-zuul bot May 4, 2021
@AlanCoding AlanCoding marked this pull request as ready for review May 4, 2021 17:07
@@ -142,7 +142,8 @@ def flush(self, force=False):
logger.exception('Database Error Saving Job Event')
duration_to_save = time.perf_counter() - duration_to_save
for e in events:
emit_event_detail(e)
if not e.event_data.get('skip_websocket_message', False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not completely unreasonable that we might change the .get to a .pop here and that'll remove the key from the API. It is useful for debugging purposes, but I don't know that we want it for production.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pro change to pop.

@AlanCoding
Copy link
Member Author

Two issues that I can fairly easily address in a new commit.

  • We should unconditionally exclude any events where stdout is blank and start_line==end_line, which are runner_on_start events and don't impact the UI
  • We should unconditionally include the minimal events
MINIMAL_EVENTS = set(['playbook_on_play_start', 'playbook_on_task_start', 'playbook_on_stats', 'EOF'])

This is a performance-sensitive decision, but I'd point toward handling these in the event_handler, so that these unconditional events can count toward the current event rate calculations.

@AlanCoding
Copy link
Member Author

Updates:

The UI has added back in a "Follow" button. If the button is clicked:

  • output window tails the most recent job output
  • it will not show a scroll bar
  • every ~1 second it will update the output, populated by an API request

@AlanCoding
Copy link
Member Author

I finished some touchups to the log, so I wanted to share what it illustrates

https://gist.github.com/AlanCoding/758ec1b7c4185af344880c0a62b2b8f6

The measured rates are only considering the messages that were sent. So what you see is that the instant rate is too high, so events are not sent until enough time has passed that the instant rate comes under the threshold.

@AlanCoding
Copy link
Member Author

Update specifically for the UI engineers @mabashian @jakemcdermott:

To make "following" look okay while drawing events from websocket messages, I return to my prior point that the algorithm (1) is required. I had to do some updates to apply it on top of the current branch here.

Algorithm (1) explanation, bursting
We send events in continuous "bursts". The minimum length of a burst is 30 events. We don't start a burst if the event would push average rate over 30 / second. Thus, fast-producing jobs will quickly "spend" their 30 events, and further events will be skipped until the 1 second mark. This repeats every second for as long as the job event rate is over the 30 / second limit.

I had to adjust my visualization expression to account for blank events:

' '.join([{True: 'x', False: 'o'}.get('skip_websocket_message' in e.event_data) for e in JobEvent.objects.filter(job_id=302).order_by('created')[:1000] if e.stdout and e.start_line != e.end_line])

For a demo chatty_tasks job with algorithm (1) that looks like this:

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

The code for this is:

diff --git a/awx/main/tasks.py b/awx/main/tasks.py
index 664da13949..8eda5b1c45 100644
--- a/awx/main/tasks.py
+++ b/awx/main/tasks.py
@@ -1230,32 +1230,32 @@ class BaseTask(object):
         if self.recent_event_timings:
             cpu_time = time.time()
             first_window_time = self.recent_event_timings[0]
-            last_window_time = self.recent_event_timings[-1]
+            queue_size = len(self.recent_event_timings)
 
             if event_data.get('event') in MINIMAL_EVENTS:
                 should_emit = True  # always send some types like playbook_on_stats
             elif event_data.get('stdout') == '' and event_data['start_line'] == event_data['end_line']:
                 should_emit = False  # exclude events with no output
+            elif event_data.get('counter', 0) < 30:
+                should_emit = True  # grace period for first page of output
             else:
-                should_emit = any(
-                    [
-                        # if 30the most recent websocket message was sent over 1 second ago
-                        cpu_time - first_window_time > 1.0,
-                        # if the very last websocket message came in over 1/30 seconds ago
-                        self.recent_event_timings.maxlen * (cpu_time - last_window_time) > 1.0,
-                        # if the queue is not yet full
-                        len(self.recent_event_timings) != self.recent_event_timings.maxlen,
-                    ]
-                )
+                too_fast = bool(cpu_time - first_window_time < 1.0)  # if 30the most recent was sent < 1 second ago
+                is_full = bool(queue_size == self.recent_event_timings.maxlen)  # if the queue is full
+                inside_window = bool(queue_size > 2)  # if we have comitted to sending a window, we have to finish it
+
+                should_emit = (not too_fast) or inside_window
+
+                if is_full and too_fast:
+                    self.recent_event_timings.clear()
+                    self.recent_event_timings.append(first_window_time)
 
             logger.debug(
-                'Job {} event {} websocket send {}, queued: {}, rate - avg: {:.3f}, last: {:.3f}'.format(
+                'Job {} event {} websocket send {}, queued: {}, rate: {:.3f}'.format(
                     self.instance.id,
                     event_data.get('counter', 0),
                     should_emit,
-                    len(self.recent_event_timings),
-                    30.0 / (cpu_time - first_window_time),
-                    1.0 / (cpu_time - last_window_time),
+                    queue_size,
+                    30.0 / (cpu_time - first_window_time)
                 )
             )

This works well for me so far.

event_data.get('counter', 0),
should_emit,
len(self.recent_event_timings),
30.0 / (cpu_time - first_window_time),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I not entirely sure, but I think you can end up dividing by 0 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I think it makes sense to just delete the log message.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants