Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

Closed
m-credera opened this issue Jun 21, 2021 · 15 comments
Closed

azure-servicebus.AutoLockRenewer change ThreadPool usage #19362

m-credera opened this issue Jun 21, 2021 · 15 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. help wanted This issue is tracking work for which community contributions would be welcomed and appreciated issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. Messaging Messaging crew question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus
Milestone

Comments

@m-credera
Copy link

Is your feature request related to a problem? Please describe.
The azure.servicebus.AutoLockRenewer is implicitly limited to renewing locks for at most max_workers number of messages. For an application that takes many messages off of the queue for parallel processing, this can present problems.

The azure.servicebus.AutoLockRenewer uses a single future inside of a ThreadPoolExecutor to monitor messages and renew locks. This means it's limited to monitoring at most max_workers messages at a time (since a thread pool only runs one future per-thread until the future completes).

Describe the solution you'd like
An alternative implementation could instead use max_workers futures monitoring a set of messages per future rather than a single message per future.

Describe alternatives you've considered
An alternative solution would be to implement parallelism with each worker taking a message off of the queue for itself to process. This is certainly possible, but there are still cases where taking more than max_workers messages off of the queue could be desirable (e.g. prefetching messages).

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jun 21, 2021
@yunhaoling yunhaoling added Client This issue points to a problem in the data-plane of the library. Messaging Messaging crew Service Bus labels Jun 22, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jun 22, 2021
@yunhaoling yunhaoling self-assigned this Jun 22, 2021
@yunhaoling
Copy link
Contributor

yunhaoling commented Jun 22, 2021

hey @michaelcredera , thanks for your feedback, I'll take a look into it.

--

You understanding is correct. The current implementation indeed has the implicit limitation -- one message per worker. This should be addressed.

An alternative implementation could instead use max_workers futures monitoring a set of messages per future rather than a single message per future.

thanks for the suggestions, they both sound like good plan. I'll allocate some time thinking about them, will get back to you after I figure it out.

@yunhaoling
Copy link
Contributor

hey @michaelcredera , sorry we have limited resourcing currently. would you like to try making a PR for the change? I would be glad to review it😊 (It's completely fine if you don't have the time, just FYI we're an open-source library and we welcome user contributions)

@yunhaoling yunhaoling added the help wanted This issue is tracking work for which community contributions would be welcomed and appreciated label Jun 29, 2021
@m-credera
Copy link
Author

Hey @yunhaoling - I can try to get a PR up this week

@m-credera
Copy link
Author

@yunhaoling Are there branching permissions on the repo? I didn't see anything in the contribution guide, but I'm getting a 403 error pushing my branch:

git push --set-upstream origin feature/19362-threadpool-improvements-for-servicebus
...
fatal: unable to access 'https://github.com/Azure/azure-sdk-for-python.git/': The requested URL returned error: 403

@yunhaoling
Copy link
Contributor

hey @michaelcredera , creating a branch directly in the azure-sdk-for-python repo is not allowed for external contributors.
instead, you could fork the azure-sdk-for-python repo and create a PR from your repo branch.

@yunhaoling yunhaoling added this to the [2021] August milestone Jul 14, 2021
@yunhaoling
Copy link
Contributor

hey @michaelcredera , just friendly checking if you need any help on the PR?

@yunhaoling
Copy link
Contributor

hey @michaelcredera , we have improved the implementation in the release v7.3.2 so that the message count that auto lock renewer could handle is not limited to the max_worker value, sorry for keeping you waiting and please try it out.

however, it is still recommended that a large max_worker or executor of large max_worker is passed in the case that there're many objects to be renewed.

@yunhaoling yunhaoling added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Aug 10, 2021
@ghost
Copy link

ghost commented Aug 10, 2021

Hi @michaelcredera. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

@Gerrit-K
Copy link

Hey @yunhaoling, I'd like to note here that this change seemed to be the cause of a significant performance degradation we've observed in a couple of our apps where processing times increased from ~1.7s to ~9s per message.

We're running microservices which process one message at a time in a single process. However, since these services run as containers on an AKS with multi-core nodes, python still "thinks" it has multiple cores available and thus the AutoLockRenewer will create a ThreadPoolExecutor with at least 5 threads (because we neither set executor nor max_workers). This, in turn, causes the dispatcher to submit the renewals as tasks instead of executing them directly. We can't say for sure, but we assume that this (in combination with the reduced sleep time to 0.5s) causes a message ping-pong which wastes precious resources.

Now, I don't want to call this out as a bug, mainly because we can't validate our claims, but also because we're able to mitigate this issue by simply setting max_workers=1 (which doesn't really matter in our single-message-at-a-time case). However, I want to raise awareness that this has (or can have) a massive impact because it changed the default behavior of the AutoLockRenewer and the change was also released in a "patch" release (7.3.1 to 7.3.2), which was quite unexpected for us.

@yunhaoling
Copy link
Contributor

hey @Gerrit-K , I sincerely apologize for the trouble that the change has caused to you and I totally agree upon that the computing resource are precious.

I suspect that the 0.5s is the key factor here as the locker loop is executed as twice faster as before.
Could you help me to verify that if hacking the sleeping time back to 1 would mitigate your issue?

we set the time interval to a shorter value because we want the auto lock renewer to be more "proactive" in renewing locks to keep messages/session live with short lock duration.

I would be glad to roll it back to 1 if it helps mitigate the performance issue you're suffering from. Also I'll seek for ways to save CPU cycles.

@Gerrit-K
Copy link

Gerrit-K commented Aug 18, 2021

Hi @yunhaoling . Thanks for your hint.

I suspect that the 0.5s is the key factor here as the locker loop is executed as twice faster as before.
Could you help me to verify that if hacking the sleeping time back to 1 would mitigate your issue?

I was honestly a bit sceptic with that being the culprit, as the performance decrease seemed to be too significant for that and also because the issue doesn't show for max_workers=1, even when the sleep time is at 0.5. I've monkey-patched the sleep time back to 1 and indeed there was no notable change. Only when using more radical values like 5s or more, I've seen slight improvements (~5-6s instead of 9s, but nowhere near the original performance). So I suspect this is not the issue here (at least not the only one).

My next guess was that the ThreadPoolExecutor was somehow conflicting with another one we're currently using for heartbeats. So I hooked up our existing executor with the AutoLockRenewer to test this hypothesis, but without luck unfortunately ... processing times stayed at around 9s.

I'll continue to dig through the changes and try to spot anything that might cause this behavior.

@yunhaoling
Copy link
Contributor

hey @Gerrit-K , I took a closer look and think it might be the dispatcher worker that's adding the performance overhead.

so basically when max_worker > 1, there's (1) one dispatcher worker thread which keeps checking the queue tasks, while (2) renew worker threads doing the real new lock work.

The dispatcher thread didn't sleep while keep pumping the queue which seems to be the bottleneck here.

what I am thinking is to add a sleep time in the end of the while loop within the _dispatch_worker as the following:

    def _dispatch_worker(self):
            # ... codes
            if time.time() - self._last_activity_timestamp >= self._dispatcher_timeout:
                self._running_dispatcher.clear()
                self._last_activity_timestamp = None
                return

            # [ADD A SLEEP HERE]
            time.sleep(self._sleep_time)  # save cpu cycles if there's currently no task in self._renew_tasks

would you kindly help me to verify it?

@Gerrit-K
Copy link

Hi @yunhaoling indeed that seems to be the culprit 👍 Now that I look at it again I'm wondering how I could not spot that ... I guess my attention was solely drawn towards the inner loop instead of the outer one. Anyhow, a simple sleep as you've proposed does the trick.

@yunhaoling
Copy link
Contributor

hey @Gerrit-K , thanks for the confirmation.

I have a PR out for that, let me do some more test with it making sure lock renew works properly with this extra sleep.
#20352

and if it's working fine, I will do a release for the improvement.
apologize again for the trouble that it's causing to you.

@Gerrit-K
Copy link

@yunhaoling no worries! Our automated tests caught this one pretty quickly, so we didn't have a big impact - just some confusion 😉 Glad we could be of help detecting this!

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. help wanted This issue is tracking work for which community contributions would be welcomed and appreciated issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. Messaging Messaging crew question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus
Projects
None yet
Development

No branches or pull requests

3 participants