Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a few additional optimizations to the callback receiver #8193

Merged

Conversation

ryanpetrello
Copy link
Contributor

@ryanpetrello ryanpetrello commented Sep 21, 2020

cc: @gamuniz

Many events, few tasks:
Kapture 2020-09-21 at 19 10 48

Many events, many tasks:
test

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@@ -58,6 +59,11 @@ def create_host_status_counts(event_data):


def emit_event_detail(event):
if (
settings.UI_LIVE_UPDATES_ENABLED is False and
event.event not in ('playbook_on_stats', 'EOF')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to maybe make this be not startswith("runner_"), but I'm having some trouble getting the UI to behave in a reasonable way cc @jakemcdermott is taking a look, I think.

@wenottingham
Copy link
Contributor

How hard is a placeholder in the text window in this case?

@jakemcdermott jakemcdermott force-pushed the ws-broadcast-limited branch 2 times, most recently from 56e3470 to 9d23f64 Compare September 21, 2020 22:57
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Sep 21, 2020

@wenottingham see updated behavior in GIFs above.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@ryanpetrello ryanpetrello changed the title don't broadcast ws:// events when UI_LIVE_UPDATES_ENABLED is False add a few additional optimizations to the callback receiver Sep 22, 2020
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@ryanpetrello
Copy link
Contributor Author

Here are some before/after numbers of this PR processing a large backlog of events on a single node m5.4xlarge install. Both are using the default settings.JOB_EVENT_WORKERS = 4, and the external logger is disabled.

before (devel)

[root@ip-10-0-4-131 ec2-user]# awx-manage callback_stats
main_jobevent
↳  last minute 123999

main_inventoryupdateevent
↳  last minute 0

main_projectupdateevent
↳  last minute 0

main_adhoccommandevent
↳  last minute 0

after (UI_LIVE_UPDATES_ENABLED = False)

[root@ip-10-0-5-5 ec2-user]# awx-manage callback_stats
main_jobevent
↳  last minute 532999

main_inventoryupdateevent
↳  last minute 0

main_projectupdateevent
↳  last minute 0

main_adhoccommandevent
↳  last minute 0

@ryanpetrello
Copy link
Contributor Author

Here is the same comparison again, but with logging configured and pointed at a local UDP port via rsyslogd:

image

before (devel)

[root@ip-10-0-4-131 ec2-user]# awx-manage callback_stats
main_jobevent
↳  last minute 91999

main_inventoryupdateevent
↳  last minute 0

main_projectupdateevent
↳  last minute 0

main_adhoccommandevent
↳  last minute 0

after (UI_LIVE_UPDATES_ENABLED = False)

[root@ip-10-0-5-5 ec2-user]# awx-manage callback_stats
main_jobevent
↳  last minute 334999

main_inventoryupdateevent
↳  last minute 0

main_projectupdateevent
↳  last minute 0

main_adhoccommandevent
↳  last minute 0

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Sep 22, 2020

With UI_LIVE_UPDATES_ENABLED = False and JOB_EVENT_WORKERS=16 on an m5.4xlarge (logging disabled), I'm able to break 1M inserts per minute:

[root@ip-10-0-5-5 ec2-user]# awx-manage callback_stats
main_jobevent
↳  last minute 1096363

main_inventoryupdateevent
↳  last minute 0

main_projectupdateevent
↳  last minute 0

main_adhoccommandevent
↳  last minute 0

@ryanpetrello
Copy link
Contributor Author

@moonrail @psuriset either of you have any time to give this patch a whirl in comparison with stable builds? Specifically, I'm curious to see how event processing time improves.

'Event data saved.',
extra=dict(python_objects=dict(job_event=self))
)
if settings.LOG_AGGREGATOR_ENABLED:
Copy link
Contributor Author

@ryanpetrello ryanpetrello Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our custom filter technically checks this flag, too, but it does so after making the logging record and looping through all the handlers (which is actually pretty expensive). If logging is disabled, it's cheaper to just not call this .info() function at all.

@@ -406,6 +410,7 @@ def _get_default(self, name):
def SETTINGS_MODULE(self):
return self._get_default('SETTINGS_MODULE')

@cachetools.cached(cache=cachetools.TTLCache(maxsize=2048, ttl=SETTING_MEMORY_TTL))
Copy link
Contributor Author

@ryanpetrello ryanpetrello Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my profiling for high workloads, the cache hits for redis actually add up and end up being quite expensive in bulk; even though it's just a round trip to a domain socket, it's still 10+ round trips per emitted event (because we access a handful of distinct settings to determine how to log) and that has a real cost when you multiply it by millions of events

For the purposes of the callback receiver, it's probably not imperative that we have real-time correctness for these setting values (caching for a few seconds locally in memory makes a notable performance improvement).

Copy link
Member

@mabashian mabashian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI changes lgtm

@@ -57,7 +58,18 @@ def create_host_status_counts(event_data):
return dict(host_status_counts)


MINIMAL_EVENTS = set([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that with a set this small that the hash calculation would overwhelm the big-O lookup advantage you gain by using a set instead of a tuple.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

@softwarefactory-project-zuul softwarefactory-project-zuul bot merged commit d7ca49c into ansible:devel Sep 23, 2020
@mcharanrm
Copy link
Contributor

mcharanrm commented Sep 24, 2020

Hello @ryanpetrello, @kdelee CC: @psuriset, We deployed patch on 3 node cluster and all the 3 nodes are virtual machines (created using libvirt) with RHEL 7.8 installed. Each vm's configured with 4vcpus, 23G RAM, and 75G of shared disk space. We can see 33.8% increase in events processing rate after changing UI_LIVE_UPDATES_ENABLED: false and having higher event processing rate can explains less event lag between job run time and jobs events processing time.
Please take a look at our CPT graphs.
Screenshot from 2020-09-24 16-36-44

Please let me know if there is anything else that i need to test it.

@ryanpetrello
Copy link
Contributor Author

This looks great - thank you @mcharanrm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants