-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @AbhinavTrips @bambriz @Pilchie @pjohari-ms @simorenoh. |
Hi @yanfang-ma, thanks for reaching out. You are absolutely correct - this is a gap we have recently identified in our SDK, and have already been working on it. The PR is here for your reference: #36514 We are just finalizing the testing work on our end to make sure we don't miss any edge cases, but these resiliency and reliability improvements to the SDK should be getting released later this month. Thank you for using our SDK, and hope this has addressed your concerns - I can also ping on this issue once the release is out if you'd like! |
Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @simorenoh Thanks for the information provided. For accounts with multiple regions but only 1 write region, after the feature is in place in new SDK version, is there any policy setting required in the code to let SDK automatically retry the request in newly write region during regional I'm asking this question because for multi region write accounts, according to document https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-multi-master?tabs=api-async#python, connection policy is required. I think the difference between single region write and multi region write is that, during primary region outage there will be failover at server side for single region write(if service managed failover is enabled), but for multi region write no failover will happen at server side. |
Hi @yanfang-ma, no there is no additional necessary setting for a single write region account - while this functionality is currently missing in the Python SDK, much like the initial document you linked says it would be the default behavior. For an account with write region A and read region B for instance, upon a service managed failover region B will be made into a write region and requests will be automatically routed to it by the SDK client with no additional configurations needed. |
Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @simorenoh Thank you so much for the clarification. As you mentioned the new version with the improvements should be getting released later this month, may I know if the new version will be a beta version or official version? And do we have ETA for the official version which includes the improvements? |
https://pypi.org/project/azure-cosmos/4.9.0/ has been released. |
Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation. |
version 4.8.0 has the relevant changes made publicly available - forgot to come back to this to close it but these changes are live. |
Describe the bug
As is mentioned in document https://learn.microsoft.com/en-us/azure/reliability/reliability-cosmos-db-nosql#service-managed-failover,
Regional failovers are detected and handled in the Azure Cosmos DB client. They don't require any changes from the application.
So I did some testing locally. My Cosmos DB account is a single region write account, and I added another region as the read region.
I have a python script keeping write documents into the Cosmos DB container.
To Reproduce
Steps to reproduce the behavior:
Write a python script to insert document into container consistently( in my code I did a create_item operation every 20 seconds)
When the script is running, I triggered "Change write region" operation on Azure Portal.
The my python script crashed with below trace:
_Traceback (most recent call last):
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 99, in refresh_endpoint_list
raise e
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 97, in refresh_endpoint_list
_self._refresh_endpoint_list_private(database_account, **kwargs)
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 111, in refresh_endpoint_list_private
database_account = self.GetDatabaseAccount(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
RecursionError: maximum recursion depth exceeded
And I checked my diagnostic logs , I saw the create document request got status code 403, and the SDK didn't do a retry in the newly prompted write region.
I even added below connectionPolicy in my code, but still didn't work.
connectionPolicy = documents.ConnectionPolicy()
connectionPolicy.EnableEndpointDiscovery=True
connectionPolicy.PreferredLocations = [ 'original write region','original read region']
client = CosmosClient(endpoint, key,connection_policy=connectionPolicy)
I also did the same testing using Cosmos DB .NET SDK. .NET SDK works fine during region failover, I only had ConnectionMode setting in my CosmosClientOptions. But no matter using Gateway Mode or Direct Mode, .NET SDK correctly detected and handled regional failover.
Expected behavior
During regional failover, Python SDK should be able to detect the event, and retry the request in newly promoted write region.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: