fix: embed the synchronizer process within the plugin runner #333

djantzen · 2025-01-16T04:01:57Z

https://canvasmedical.atlassian.net/browse/KOALA-2440

At present, the plugin synchronizer runs as a separate process managed by Circus. Problems with doing so:

it requires about 60mb of memory to run
its actions have to be coordinated with the plugin runner via a SIGHUP

This PR moves the responsibilities of the synchronizer into an async task performed by the plugin runner.

Related home-app PR: https://github.com/canvas-medical/canvas/pull/17170

jamagalhaes

Overal LGTM.
Since we are using a single thread with concurrency, I would add a await asyncio.sleep(0.1) at the end of the loop to prevent a busy loop when getting messages from the pubsub channel

plugin_runner/plugin_runner.py

plugin_runner/plugin_installer.py

jamagalhaes · 2025-01-29T22:00:17Z

@djantzen it won't prevent other operations from occurring but we don't need to continuously using cpu and network to check for new messages. I would say that 1 second of delay it's totally fine.

settings.py

beaugunderson · 2025-01-29T22:58:46Z

Since we are using a single thread with concurrency, I would add a await asyncio.sleep(0.1) at the end of the loop to prevent a busy loop when getting messages from the pubsub channel

get_message is a blocking call, it accepts an optional timeout parameter, we should specify that and not add our own asyncio.sleep:

jamagalhaes · 2025-01-29T23:15:43Z

Since we are using a single thread with concurrency, I would add a await asyncio.sleep(0.1) at the end of the loop to prevent a busy loop when getting messages from the pubsub channel

get_message is a blocking call, it accepts an optional timeout parameter, we should specify that and not add our own asyncio.sleep:

totally forgot that get_message already allows to specificy a timeout. so yes, no need to use asyncio.sleep 👍

jamagalhaes · 2025-01-29T23:45:35Z

@djantzen @beaugunderson since we have the capability of specifying a timeout with get_message, we could set it to None to wait indefinitely. This will maximize efficiency

beaugunderson · 2025-01-30T00:32:00Z

@djantzen @beaugunderson since we have the capability of specifying a timeout with get_message, we could set it to None to wait indefinitely. This will maximize efficiency

then blocking=True would be the case right and it turns into a blocking read?

djantzen · 2025-01-30T03:57:00Z

@djantzen it won't prevent other operations from occurring but we don't need to continuously using cpu and network to check for new messages. I would say that 1 second of delay it's totally fine.

1 second, or a tenth of a second? I'm not entirely clear on the tradeoffs here. Seems like a longer timeout would make the listener more responsive to pubsub messages, at the cost of less responsiveness to events firing within the system. I'd think we want to prioritize the events rather than the reload commands.

jamagalhaes · 2025-01-30T12:15:34Z

@djantzen @beaugunderson since we have the capability of specifying a timeout with get_message, we could set it to None to wait indefinitely. This will maximize efficiency

then blocking=True would be the case right and it turns into a blocking read?

Initially, I thought the same—that calling await redis.get_message(timeout=None) would block the entire event loop indefinitely. However, after looking into the implementation, that’s not what actually happens under the hood.

The key reason is that, although the function signature and docstrings remain unchanged for asyncio, the async version of get_message is built on top of asyncio streams. These streams are designed to be non-blocking, meaning they release control of the event loop while waiting for a new message.

Here’s what happens:

When get_message(timeout=None) is called, it internally does not perform a traditional blocking wait.
Instead, it relies on asyncio.StreamReader.read which pauses execution and yields control back to the event loop.
This allows other async tasks to run in parallel while waiting for a new Redis message to arrive.

I agree that this isn't explicitly clear in the documentation, but after debugging the implementation, I can confirm that this is how it works. So, using timeout=None is safe—it won’t block the entire event loop, just the coroutine that awaits the message.

djantzen · 2025-01-30T16:54:48Z

@djantzen @beaugunderson since we have the capability of specifying a timeout with get_message, we could set it to None to wait indefinitely. This will maximize efficiency

then blocking=True would be the case right and it turns into a blocking read?

Initially, I thought the same—that calling await redis.get_message(timeout=None) would block the entire event loop indefinitely. However, after looking into the implementation, that’s not what actually happens under the hood.

The key reason is that, although the function signature and docstrings remain unchanged for asyncio, the async version of get_message is built on top of asyncio streams. These streams are designed to be non-blocking, meaning they release control of the event loop while waiting for a new message.

Here’s what happens:

When get_message(timeout=None) is called, it internally does not perform a traditional blocking wait.

Instead, it relies on asyncio.StreamReader.read which pauses execution and yields control back to the event loop.

This allows other async tasks to run in parallel while waiting for a new Redis message to arrive.

I agree that this isn't explicitly clear in the documentation, but after debugging the implementation, I can confirm that this is how it works. So, using timeout=None is safe—it won’t block the entire event loop, just the coroutine that awaits the message.

Thanks for digging into the code @jamagalhaes . This makes sense to me. I was beginning to wonder about the whole point of asyncio if we still had to perform tricks like sleeping and timing out to avoid blocking other async tasks.

jamagalhaes · 2025-01-30T16:59:05Z

Thanks for digging into the code @jamagalhaes . This makes sense to me. I was beginning to wonder about the whole point of asyncio if we still had to perform tricks like sleeping and timing out to avoid blocking other async tasks.

@djantzen you're welcome. you still need to explicitly specify timeout=None because by default timeout is 0.

plugin_runner/plugin_runner.py

…e pubsub client

djantzen · 2025-01-30T21:02:29Z

Actually, it may be unnecessary since we’re getting host as part of our other metrics already. I’ll remove it.

…

On Jan 30, 2025, at 9:01 AM, José Magalhães ***@***.***> wrote: @jamagalhaes commented on this pull request. In settings.py <#333 (comment)>: > +try: + # Aptible stores a unique identifier in this file + with open("/proc/sys/kernel/hostname") as file: + HOST_IDENTIFIER = file.read().strip() +except FileNotFoundError: + HOST_IDENTIFIER = "localhost" since this is not an env var, how it will be communicated to grafana? I'm just curious about it. — Reply to this email directly, view it on GitHub <#333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALB6UD2XO5ID2Q3BRQFZZD2NJLFXAVCNFSM6AAAAABVIVS4ZWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBUGQ4TSMJRGE>. You are receiving this because you were mentioned.

-- Disclaimer: This e-mail and any attachments may contain confidential information. If you are not the intended recipient, any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the Security Officer ***@***.***>, and destroy the original transmission and any attachments without reading or saving.

beaugunderson · 2025-01-30T21:04:46Z

Actually, it may be unnecessary since we’re getting host as part of our other metrics already. I’ll remove it.

it's not used here so can be removed without changing anything but since these are statsd metrics they include the metrics from the environment as defined in telegraf.conf, namely:

[global_tags]
  source = "home-app"
  customer = "$CUSTOMER_IDENTIFIER"
  release_channel = "$RELEASE_CHANNEL"
  process_type = "$APTIBLE_PROCESS_TYPE"
  organization = "$ORGANIZATION"

none of which is the hostname, so if you need it you'll need to add it still

beaugunderson · 2025-01-30T21:11:27Z

n/m we discussed more here, telegraf will set it correctly https://canvas-medical.slack.com/archives/C1UGTHCKS/p1738271118532679

djantzen · 2025-01-30T22:34:20Z

Would the problem manifest if Redis pubsub became overburdened and was slow to receive and acknowledge requests, thereby blocking other tasks? TBH it didn’t even register with me that logs of these administrative actions would get routed through Redis. Do developers really need to see this activity?

…

On Jan 30, 2025, at 9:15 AM, Christopher Sande ***@***.***> wrote: @csande commented on this pull request. In plugin_runner/plugin_runner.py <#333 (comment)>: > @@ -192,14 +198,50 @@ async def ReloadPlugins( self, request: ReloadPluginsRequest, context: Any ) -> AsyncGenerator[ReloadPluginsResponse, None]: """This is invoked when we need to reload plugins.""" + log.info("Reloading plugins...") Is this statement a blocking call, in addition to the other calls to the logger? If you dig into our logging module, it looks like the PubSubBase class uses the non-asyncio version of Redis. FWIW, we're also calling it a few times in HandleEvent, another asyncio context. — Reply to this email directly, view it on GitHub <#333 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALB6UCTRUAATKOPY6KEUET2NJM4NAVCNFSM6AAAAABVIVS4ZWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBUGUZTOOJTHE>. You are receiving this because you were mentioned.

-- Disclaimer: This e-mail and any attachments may contain confidential information. If you are not the intended recipient, any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the Security Officer ***@***.***>, and destroy the original transmission and any attachments without reading or saving.

csande · 2025-01-30T22:58:16Z

Would the problem manifest if Redis pubsub became overburdened and was slow to receive and acknowledge requests, thereby blocking other tasks? TBH it didn’t even register with me that logs of these administrative actions would get routed through Redis. Do developers really need to see this activity?
…
On Jan 30, 2025, at 9:15 AM, Christopher Sande @.> wrote: @csande commented on this pull request. In plugin_runner/plugin_runner.py <#333 (comment)>: > @@ -192,14 +198,50 @@ async def ReloadPlugins( self, request: ReloadPluginsRequest, context: Any ) -> AsyncGenerator[ReloadPluginsResponse, None]: """This is invoked when we need to reload plugins.""" + log.info("Reloading plugins...") Is this statement a blocking call, in addition to the other calls to the logger? If you dig into our logging module, it looks like the PubSubBase class uses the non-asyncio version of Redis. FWIW, we're also calling it a few times in HandleEvent, another asyncio context. — Reply to this email directly, view it on GitHub <#333 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALB6UCTRUAATKOPY6KEUET2NJM4NAVCNFSM6AAAAABVIVS4ZWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBUGUZTOOJTHE. You are receiving this because you were mentioned.
-- Disclaimer: This e-mail and any attachments may contain confidential information. If you are not the intended recipient, any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the Security Officer @.>, and destroy the original transmission and any attachments without reading or saving.

I suppose this is theoretical until we see it happen, but if Redis were overburdened, it could block the main event loop. If the main event loop is blocked, then that would affect the ability of the plugin runner to continue handling incoming events. On the home-app side, send_to_plugin_runner is a blocking call, and we're funneling more and more activity through that function as we add more event hooks. So you'd see home-app affected when it's unable to receive responses from the plugin runner.

It's worth mentioning that if we were using async Redis, and Redis were overburdened, we would still see problems. They would probably just manifest differently. Tasks would still have trouble running, but the reason would be due to Redis directly, rather than being blocked by other tasks. i.e. being blocked indirectly by Redis.

When you introduce blocking calls in an asyncio project, it can be a chokepoint for everything. Even if Redis isn't overburdened, the Plugin Runner main event loop can't do anything else for the few milliseconds that it is communicating with the sync Redis client.

There are a few things we could do.

As has been discussed, we could get rid of asyncio (at least for now).
Developers need a sync logger that publishes to Redis, so we can't touch that, but we could make an async logger for the Plugin Runner to use.
Redis timeout values and error handling would be another thing to look at. In Fumage, errors with the Redis token introspection cache do not impede the ability of the API to continue servicing requests.

jamagalhaes · 2025-01-31T18:22:15Z

Would the problem manifest if Redis pubsub became overburdened and was slow to receive and acknowledge requests, thereby blocking other tasks? TBH it didn’t even register with me that logs of these administrative actions would get routed through Redis. Do developers really need to see this activity?
…
On Jan 30, 2025, at 9:15 AM, Christopher Sande @.> wrote: @csande commented on this pull request. In plugin_runner/plugin_runner.py <#333 (comment)>: > @@ -192,14 +198,50 @@ async def ReloadPlugins( self, request: ReloadPluginsRequest, context: Any ) -> AsyncGenerator[ReloadPluginsResponse, None]: """This is invoked when we need to reload plugins.""" + log.info("Reloading plugins...") Is this statement a blocking call, in addition to the other calls to the logger? If you dig into our logging module, it looks like the PubSubBase class uses the non-asyncio version of Redis. FWIW, we're also calling it a few times in HandleEvent, another asyncio context. — Reply to this email directly, view it on GitHub <#333 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALB6UCTRUAATKOPY6KEUET2NJM4NAVCNFSM6AAAAABVIVS4ZWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBUGUZTOOJTHE. You are receiving this because you were mentioned.
-- Disclaimer: This e-mail and any attachments may contain confidential information. If you are not the intended recipient, any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the Security Officer _@**.**_>, and destroy the original transmission and any attachments without reading or saving.

I suppose this is theoretical until we see it happen, but if Redis were overburdened, it could block the main event loop. If the main event loop is blocked, then that would affect the ability of the plugin runner to continue handling incoming events. On the home-app side, send_to_plugin_runner is a blocking call, and we're funneling more and more activity through that function as we add more event hooks. So you'd see home-app affected when it's unable to receive responses from the plugin runner.

It's worth mentioning that if we were using async Redis, and Redis were overburdened, we would still see problems. They would probably just manifest differently. Tasks would still have trouble running, but the reason would be due to Redis directly, rather than being blocked by other tasks. i.e. being blocked indirectly by Redis.

When you introduce blocking calls in an asyncio project, it can be a chokepoint for everything. Even if Redis isn't overburdened, the Plugin Runner main event loop can't do anything else for the few milliseconds that it is communicating with the sync Redis client.

There are a few things we could do.

As has been discussed, we could get rid of asyncio (at least for now).

Developers need a sync logger that publishes to Redis, so we can't touch that, but we could make an async logger for the Plugin Runner to use.

Redis timeout values and error handling would be another thing to look at. In Fumage, errors with the Redis token introspection cache do not impede the ability of the API to continue servicing requests.

@csande While this scenario is theoretically possible, I believe it would be one of the lesser concerns in practice. If our Redis instance were overburdened, many other components in home-app that rely on it would likely be impacted as well, not just the Plugin Runner.

That said, you’re absolutely right in pointing out that redis.publish is a blocking call, which could momentarily hold up the main event loop. However, this shouldn't significantly degrade plugin performance. To mitigate this, I implemented a hybrid approach: if an event loop is already running, we use the asynchronous publish; otherwise, we fall back to the synchronous version. Take a look at the implementation and let me know your thoughts. #381

csande · 2025-01-31T18:32:23Z

Would the problem manifest if Redis pubsub became overburdened and was slow to receive and acknowledge requests, thereby blocking other tasks? TBH it didn’t even register with me that logs of these administrative actions would get routed through Redis. Do developers really need to see this activity?
…
On Jan 30, 2025, at 9:15 AM, Christopher Sande @.> wrote: @csande commented on this pull request. In plugin_runner/plugin_runner.py <#333 (comment)>: > @@ -192,14 +198,50 @@ async def ReloadPlugins( self, request: ReloadPluginsRequest, context: Any ) -> AsyncGenerator[ReloadPluginsResponse, None]: """This is invoked when we need to reload plugins.""" + log.info("Reloading plugins...") Is this statement a blocking call, in addition to the other calls to the logger? If you dig into our logging module, it looks like the PubSubBase class uses the non-asyncio version of Redis. FWIW, we're also calling it a few times in HandleEvent, another asyncio context. — Reply to this email directly, view it on GitHub <#333 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALB6UCTRUAATKOPY6KEUET2NJM4NAVCNFSM6AAAAABVIVS4ZWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBUGUZTOOJTHE. You are receiving this because you were mentioned.
-- Disclaimer: This e-mail and any attachments may contain confidential information. If you are not the intended recipient, any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the Security Officer _@**.**_>, and destroy the original transmission and any attachments without reading or saving.

I suppose this is theoretical until we see it happen, but if Redis were overburdened, it could block the main event loop. If the main event loop is blocked, then that would affect the ability of the plugin runner to continue handling incoming events. On the home-app side, send_to_plugin_runner is a blocking call, and we're funneling more and more activity through that function as we add more event hooks. So you'd see home-app affected when it's unable to receive responses from the plugin runner.
It's worth mentioning that if we were using async Redis, and Redis were overburdened, we would still see problems. They would probably just manifest differently. Tasks would still have trouble running, but the reason would be due to Redis directly, rather than being blocked by other tasks. i.e. being blocked indirectly by Redis.
When you introduce blocking calls in an asyncio project, it can be a chokepoint for everything. Even if Redis isn't overburdened, the Plugin Runner main event loop can't do anything else for the few milliseconds that it is communicating with the sync Redis client.
There are a few things we could do.

As has been discussed, we could get rid of asyncio (at least for now).

Developers need a sync logger that publishes to Redis, so we can't touch that, but we could make an async logger for the Plugin Runner to use.

Redis timeout values and error handling would be another thing to look at. In Fumage, errors with the Redis token introspection cache do not impede the ability of the API to continue servicing requests.

@csande While this scenario is theoretically possible, I believe it would be one of the lesser concerns in practice. If our Redis instance were overburdened, many other components in home-app that rely on it would likely be impacted as well, not just the Plugin Runner.

That said, you’re absolutely right in pointing out that redis.publish is a blocking call, which could momentarily hold up the main event loop. However, this shouldn't significantly degrade plugin performance. To mitigate this, I implemented a hybrid approach: if an event loop is already running, we use the asynchronous publish; otherwise, we fall back to the synchronous version. Take a look at the implementation and let me know your thoughts. #381

Thank you for taking a look at this @jamagalhaes. I'm in agreement that this is theoretical and not a huge concern right now, but it's certainly a core tenet (i.e. this isn't premature optimization) to not make blocking calls in an asyncio framework. I don't know about our plans for scaling in the longer-term, but if we want the plugin runner to handle hundreds or thousands of events at once, those blocking calls to even a non-overburdened Redis will eventually add up, and the performance impact may be noticeable at some point.

djantzen changed the title ~~wip to see if async pubsub listener works~~ fix: wip to see if async pubsub listener works Jan 16, 2025

djantzen force-pushed the fix/embed-synchronizer branch 4 times, most recently from e223e50 to 5f29cba Compare January 23, 2025 02:56

djantzen marked this pull request as ready for review January 23, 2025 02:57

djantzen requested a review from a team as a code owner January 23, 2025 02:57

djantzen force-pushed the fix/embed-synchronizer branch 2 times, most recently from e73a389 to 5f29cba Compare January 23, 2025 05:25

djantzen changed the title ~~fix: wip to see if async pubsub listener works~~ fix: embed the synchronizer process within the plugin runner Jan 23, 2025

djantzen requested review from beaugunderson and jamagalhaes January 23, 2025 20:26

djantzen force-pushed the fix/embed-synchronizer branch from 3c87abd to 9659288 Compare January 24, 2025 05:44

beaugunderson approved these changes Jan 29, 2025

View reviewed changes

jamagalhaes reviewed Jan 29, 2025

View reviewed changes

plugin_runner/plugin_runner.py Outdated Show resolved Hide resolved

plugin_runner/plugin_installer.py Show resolved Hide resolved

djantzen closed this Jan 29, 2025

djantzen reopened this Jan 29, 2025

djantzen force-pushed the fix/embed-synchronizer branch from 1be3823 to 0856151 Compare January 29, 2025 20:35

djantzen added 11 commits January 29, 2025 22:25

wip to see if async pubsub listener works

39b49b8

force rebuild

c4ea11c

debug logging

278103a

debug logging

c15f1cf

get_message instead of listen

aadb6d6

await psubscribe

fa7879e

async publish

9f5d62e

cleanup commit

a95297e

rebase from main

c1c30ce

linter compliance

29022c6

test that the synchronizer installs and loads plugins

8978b83

djantzen added 4 commits January 29, 2025 22:25

import specific settings directly instead of the whole file

d1ea90c

fix looping in listener

4c3a418

remove unnecessary test fixtures for message pubsub

8aa7f95

simplify while loop when listening for synchronizer messages

3e9d520

jamagalhaes force-pushed the fix/embed-synchronizer branch from 0856151 to 3e9d520 Compare January 29, 2025 22:25

jamagalhaes reviewed Jan 29, 2025

View reviewed changes

settings.py Outdated Show resolved Hide resolved

add timeout to get_message

462dcda

csande reviewed Jan 30, 2025

View reviewed changes

plugin_runner/plugin_runner.py Show resolved Hide resolved

set timeout to None for pubsub messages to allow async behavior of th…

ddbe417

…e pubsub client

remove unnecessary hostname ENV variable

214f621

jamagalhaes mentioned this pull request Jan 31, 2025

feat: async pubsub #381

Draft

djantzen merged commit ab576f2 into main Feb 2, 2025
5 checks passed

djantzen deleted the fix/embed-synchronizer branch February 2, 2025 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: embed the synchronizer process within the plugin runner #333

fix: embed the synchronizer process within the plugin runner #333

djantzen commented Jan 16, 2025 •

edited

Loading

jamagalhaes left a comment

jamagalhaes commented Jan 29, 2025

beaugunderson commented Jan 29, 2025

jamagalhaes commented Jan 29, 2025

jamagalhaes commented Jan 29, 2025 •

edited

Loading

beaugunderson commented Jan 30, 2025

djantzen commented Jan 30, 2025

jamagalhaes commented Jan 30, 2025

djantzen commented Jan 30, 2025

jamagalhaes commented Jan 30, 2025

djantzen commented Jan 30, 2025 via email

beaugunderson commented Jan 30, 2025 •

edited

Loading

beaugunderson commented Jan 30, 2025

djantzen commented Jan 30, 2025 via email

csande commented Jan 30, 2025

jamagalhaes commented Jan 31, 2025

csande commented Jan 31, 2025 •

edited

Loading

fix: embed the synchronizer process within the plugin runner #333

fix: embed the synchronizer process within the plugin runner #333

Conversation

djantzen commented Jan 16, 2025 • edited Loading

jamagalhaes left a comment

Choose a reason for hiding this comment

jamagalhaes commented Jan 29, 2025

beaugunderson commented Jan 29, 2025

jamagalhaes commented Jan 29, 2025

jamagalhaes commented Jan 29, 2025 • edited Loading

beaugunderson commented Jan 30, 2025

djantzen commented Jan 30, 2025

jamagalhaes commented Jan 30, 2025

djantzen commented Jan 30, 2025

jamagalhaes commented Jan 30, 2025

djantzen commented Jan 30, 2025 via email

beaugunderson commented Jan 30, 2025 • edited Loading

beaugunderson commented Jan 30, 2025

djantzen commented Jan 30, 2025 via email

csande commented Jan 30, 2025

jamagalhaes commented Jan 31, 2025

csande commented Jan 31, 2025 • edited Loading

djantzen commented Jan 16, 2025 •

edited

Loading

jamagalhaes commented Jan 29, 2025 •

edited

Loading

beaugunderson commented Jan 30, 2025 •

edited

Loading

csande commented Jan 31, 2025 •

edited

Loading