azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

m-credera · 2021-06-21T13:45:29Z

Is your feature request related to a problem? Please describe.
The azure.servicebus.AutoLockRenewer is implicitly limited to renewing locks for at most max_workers number of messages. For an application that takes many messages off of the queue for parallel processing, this can present problems.

The azure.servicebus.AutoLockRenewer uses a single future inside of a ThreadPoolExecutor to monitor messages and renew locks. This means it's limited to monitoring at most max_workers messages at a time (since a thread pool only runs one future per-thread until the future completes).

Describe the solution you'd like
An alternative implementation could instead use max_workers futures monitoring a set of messages per future rather than a single message per future.

Describe alternatives you've considered
An alternative solution would be to implement parallelism with each worker taking a message off of the queue for itself to process. This is certainly possible, but there are still cases where taking more than max_workers messages off of the queue could be desirable (e.g. prefetching messages).

The text was updated successfully, but these errors were encountered:

yunhaoling · 2021-06-22T10:11:38Z

hey @michaelcredera , thanks for your feedback, I'll take a look into it.

--

You understanding is correct. The current implementation indeed has the implicit limitation -- one message per worker. This should be addressed.

An alternative implementation could instead use max_workers futures monitoring a set of messages per future rather than a single message per future.

thanks for the suggestions, they both sound like good plan. I'll allocate some time thinking about them, will get back to you after I figure it out.

yunhaoling · 2021-06-29T18:24:02Z

hey @michaelcredera , sorry we have limited resourcing currently. would you like to try making a PR for the change? I would be glad to review it😊 (It's completely fine if you don't have the time, just FYI we're an open-source library and we welcome user contributions)

m-credera · 2021-07-07T19:50:18Z

Hey @yunhaoling - I can try to get a PR up this week

m-credera · 2021-07-07T21:16:49Z

@yunhaoling Are there branching permissions on the repo? I didn't see anything in the contribution guide, but I'm getting a 403 error pushing my branch:

git push --set-upstream origin feature/19362-threadpool-improvements-for-servicebus
...
fatal: unable to access 'https://github.com/Azure/azure-sdk-for-python.git/': The requested URL returned error: 403

yunhaoling · 2021-07-08T15:03:39Z

hey @michaelcredera , creating a branch directly in the azure-sdk-for-python repo is not allowed for external contributors.
instead, you could fork the azure-sdk-for-python repo and create a PR from your repo branch.

yunhaoling · 2021-07-20T19:42:53Z

hey @michaelcredera , just friendly checking if you need any help on the PR?

yunhaoling · 2021-08-10T19:29:22Z

hey @michaelcredera , we have improved the implementation in the release v7.3.2 so that the message count that auto lock renewer could handle is not limited to the max_worker value, sorry for keeping you waiting and please try it out.

however, it is still recommended that a large max_worker or executor of large max_worker is passed in the case that there're many objects to be renewed.

ghost · 2021-08-10T19:29:33Z

Hi @michaelcredera. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

Gerrit-K · 2021-08-16T12:44:14Z

Hey @yunhaoling, I'd like to note here that this change seemed to be the cause of a significant performance degradation we've observed in a couple of our apps where processing times increased from ~1.7s to ~9s per message.

We're running microservices which process one message at a time in a single process. However, since these services run as containers on an AKS with multi-core nodes, python still "thinks" it has multiple cores available and thus the AutoLockRenewer will create a ThreadPoolExecutor with at least 5 threads (because we neither set executor nor max_workers). This, in turn, causes the dispatcher to submit the renewals as tasks instead of executing them directly. We can't say for sure, but we assume that this (in combination with the reduced sleep time to 0.5s) causes a message ping-pong which wastes precious resources.

Now, I don't want to call this out as a bug, mainly because we can't validate our claims, but also because we're able to mitigate this issue by simply setting max_workers=1 (which doesn't really matter in our single-message-at-a-time case). However, I want to raise awareness that this has (or can have) a massive impact because it changed the default behavior of the AutoLockRenewer and the change was also released in a "patch" release (7.3.1 to 7.3.2), which was quite unexpected for us.

yunhaoling · 2021-08-17T04:20:39Z

hey @Gerrit-K , I sincerely apologize for the trouble that the change has caused to you and I totally agree upon that the computing resource are precious.

I suspect that the 0.5s is the key factor here as the locker loop is executed as twice faster as before.
Could you help me to verify that if hacking the sleeping time back to 1 would mitigate your issue?

we set the time interval to a shorter value because we want the auto lock renewer to be more "proactive" in renewing locks to keep messages/session live with short lock duration.

I would be glad to roll it back to 1 if it helps mitigate the performance issue you're suffering from. Also I'll seek for ways to save CPU cycles.

Gerrit-K · 2021-08-18T13:36:04Z

Hi @yunhaoling . Thanks for your hint.

I suspect that the 0.5s is the key factor here as the locker loop is executed as twice faster as before.
Could you help me to verify that if hacking the sleeping time back to 1 would mitigate your issue?

I was honestly a bit sceptic with that being the culprit, as the performance decrease seemed to be too significant for that and also because the issue doesn't show for max_workers=1, even when the sleep time is at 0.5. I've monkey-patched the sleep time back to 1 and indeed there was no notable change. Only when using more radical values like 5s or more, I've seen slight improvements (~5-6s instead of 9s, but nowhere near the original performance). So I suspect this is not the issue here (at least not the only one).

My next guess was that the ThreadPoolExecutor was somehow conflicting with another one we're currently using for heartbeats. So I hooked up our existing executor with the AutoLockRenewer to test this hypothesis, but without luck unfortunately ... processing times stayed at around 9s.

I'll continue to dig through the changes and try to spot anything that might cause this behavior.

yunhaoling · 2021-08-19T04:48:06Z

hey @Gerrit-K , I took a closer look and think it might be the dispatcher worker that's adding the performance overhead.

so basically when max_worker > 1, there's (1) one dispatcher worker thread which keeps checking the queue tasks, while (2) renew worker threads doing the real new lock work.

The dispatcher thread didn't sleep while keep pumping the queue which seems to be the bottleneck here.

what I am thinking is to add a sleep time in the end of the while loop within the _dispatch_worker as the following:

    def _dispatch_worker(self):
            # ... codes
            if time.time() - self._last_activity_timestamp >= self._dispatcher_timeout:
                self._running_dispatcher.clear()
                self._last_activity_timestamp = None
                return

            # [ADD A SLEEP HERE]
            time.sleep(self._sleep_time)  # save cpu cycles if there's currently no task in self._renew_tasks

would you kindly help me to verify it?

Gerrit-K · 2021-08-19T11:51:34Z

Hi @yunhaoling indeed that seems to be the culprit 👍 Now that I look at it again I'm wondering how I could not spot that ... I guess my attention was solely drawn towards the inner loop instead of the outer one. Anyhow, a simple sleep as you've proposed does the trick.

yunhaoling · 2021-08-19T15:54:42Z

hey @Gerrit-K , thanks for the confirmation.

I have a PR out for that, let me do some more test with it making sure lock renew works properly with this extra sleep.
#20352

and if it's working fine, I will do a release for the improvement.
apologize again for the trouble that it's causing to you.

Gerrit-K · 2021-08-20T07:46:20Z

@yunhaoling no worries! Our automated tests caught this one pretty quickly, so we didn't have a big impact - just some confusion 😉 Glad we could be of help detecting this!

yunhaoling added Client This issue points to a problem in the data-plane of the library. Messaging Messaging crew Service Bus labels Jun 22, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jun 22, 2021

yunhaoling self-assigned this Jun 22, 2021

yunhaoling added the help wanted This issue is tracking work for which community contributions would be welcomed and appreciated label Jun 29, 2021

yunhaoling added this to the [2021] August milestone Jul 14, 2021

yunhaoling mentioned this issue Aug 3, 2021

[ServiceBus] Improve auto lock renewer implementation #20071

Merged

yunhaoling closed this as completed Aug 10, 2021

yunhaoling added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Aug 10, 2021

yunhaoling mentioned this issue Aug 19, 2021

[ServiceBus] Improve Sync AutoLockRenewer to save CPU cycles #20354

Closed

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

m-credera commented Jun 21, 2021

yunhaoling commented Jun 22, 2021 •

edited

Loading

yunhaoling commented Jun 29, 2021

m-credera commented Jul 7, 2021

m-credera commented Jul 7, 2021

yunhaoling commented Jul 8, 2021

yunhaoling commented Jul 20, 2021

yunhaoling commented Aug 10, 2021

ghost commented Aug 10, 2021

Gerrit-K commented Aug 16, 2021

yunhaoling commented Aug 17, 2021

Gerrit-K commented Aug 18, 2021 •

edited

Loading

yunhaoling commented Aug 19, 2021

Gerrit-K commented Aug 19, 2021

yunhaoling commented Aug 19, 2021

Gerrit-K commented Aug 20, 2021

azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

Comments

m-credera commented Jun 21, 2021

yunhaoling commented Jun 22, 2021 • edited Loading

yunhaoling commented Jun 29, 2021

m-credera commented Jul 7, 2021

m-credera commented Jul 7, 2021

yunhaoling commented Jul 8, 2021

yunhaoling commented Jul 20, 2021

yunhaoling commented Aug 10, 2021

ghost commented Aug 10, 2021

Gerrit-K commented Aug 16, 2021

yunhaoling commented Aug 17, 2021

Gerrit-K commented Aug 18, 2021 • edited Loading

yunhaoling commented Aug 19, 2021

Gerrit-K commented Aug 19, 2021

yunhaoling commented Aug 19, 2021

Gerrit-K commented Aug 20, 2021

yunhaoling commented Jun 22, 2021 •

edited

Loading

Gerrit-K commented Aug 18, 2021 •

edited

Loading