Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721

yanfang-ma · 2024-10-04T08:16:29Z

Package Name: azure-cosmos
Package Version: 4.7.0
Operating System: Windows 11
Python Version: 3.12.7

Describe the bug

As is mentioned in document https://learn.microsoft.com/en-us/azure/reliability/reliability-cosmos-db-nosql#service-managed-failover,
Regional failovers are detected and handled in the Azure Cosmos DB client. They don't require any changes from the application.

So I did some testing locally. My Cosmos DB account is a single region write account, and I added another region as the read region.
I have a python script keeping write documents into the Cosmos DB container.

To Reproduce
Steps to reproduce the behavior:
Write a python script to insert document into container consistently( in my code I did a create_item operation every 20 seconds)
When the script is running, I triggered "Change write region" operation on Azure Portal.
The my python script crashed with below trace:
_Traceback (most recent call last):
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 99, in refresh_endpoint_list
raise e
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 97, in refresh_endpoint_list
_self._refresh_endpoint_list_private(database_account, **kwargs)
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 111, in refresh_endpoint_list_private
database_account = self.GetDatabaseAccount(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
RecursionError: maximum recursion depth exceeded

And I checked my diagnostic logs , I saw the create document request got status code 403, and the SDK didn't do a retry in the newly prompted write region.

I even added below connectionPolicy in my code, but still didn't work.
connectionPolicy = documents.ConnectionPolicy()
connectionPolicy.EnableEndpointDiscovery=True
connectionPolicy.PreferredLocations = [ 'original write region','original read region']
client = CosmosClient(endpoint, key,connection_policy=connectionPolicy)

I also did the same testing using Cosmos DB .NET SDK. .NET SDK works fine during region failover, I only had ConnectionMode setting in my CosmosClientOptions. But no matter using Gateway Mode or Direct Mode, .NET SDK correctly detected and handled regional failover.

Expected behavior
During regional failover, Python SDK should be able to detect the event, and retry the request in newly promoted write region.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

github-actions · 2024-10-04T08:17:28Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @AbhinavTrips @bambriz @Pilchie @pjohari-ms @simorenoh.

simorenoh · 2024-10-04T14:13:14Z

Hi @yanfang-ma, thanks for reaching out. You are absolutely correct - this is a gap we have recently identified in our SDK, and have already been working on it. The PR is here for your reference: #36514

We are just finalizing the testing work on our end to make sure we don't miss any edge cases, but these resiliency and reliability improvements to the SDK should be getting released later this month. Thank you for using our SDK, and hope this has addressed your concerns - I can also ping on this issue once the release is out if you'd like!

github-actions · 2024-10-04T14:25:18Z

Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

yanfang-ma · 2024-10-07T02:42:50Z

Hi @simorenoh

Thanks for the information provided.

For accounts with multiple regions but only 1 write region, after the feature is in place in new SDK version, is there any policy setting required in the code to let SDK automatically retry the request in newly write region during regional
failover?

I'm asking this question because for multi region write accounts, according to document https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-multi-master?tabs=api-async#python, connection policy is required.
So I would like to double check with you if any policy setting is required for single region write accounts.

I think the difference between single region write and multi region write is that, during primary region outage there will be failover at server side for single region write(if service managed failover is enabled), but for multi region write no failover will happen at server side.

simorenoh · 2024-10-07T04:01:37Z

Hi @yanfang-ma, no there is no additional necessary setting for a single write region account - while this functionality is currently missing in the Python SDK, much like the initial document you linked says it would be the default behavior.

For an account with write region A and read region B for instance, upon a service managed failover region B will be made into a write region and requests will be automatically routed to it by the SDK client with no additional configurations needed.

github-actions · 2024-10-07T04:02:02Z

Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

yanfang-ma · 2024-10-09T08:57:30Z

Hi @simorenoh

Thank you so much for the clarification. As you mentioned the new version with the improvements should be getting released later this month, may I know if the new version will be a beta version or official version?
I checked the change log https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/cosmos/azure-cosmos/CHANGELOG.md and I see there are beta versions.

And do we have ETA for the official version which includes the improvements?

xiangyan99 · 2025-01-08T17:43:08Z

https://pypi.org/project/azure-cosmos/4.9.0/ has been released.

github-actions · 2025-01-08T17:43:30Z

Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

simorenoh · 2025-01-08T17:53:50Z

version 4.8.0 has the relevant changes made publicly available - forgot to come back to this to close it but these changes are live.

github-project-automation bot added this to CosmosDB Python Eco-System Oct 4, 2024

simorenoh added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Oct 4, 2024

github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Oct 7, 2024

simorenoh added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Oct 7, 2024

simorenoh self-assigned this Oct 7, 2024

simorenoh removed the needs-author-feedback Workflow: More information is needed from author to address the issue. label Oct 7, 2024

github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Oct 7, 2024

xiangyan99 added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Jan 8, 2025

github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jan 8, 2025

simorenoh closed this as completed Jan 8, 2025

github-project-automation bot moved this to Done in CosmosDB Python Eco-System Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721

Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721

yanfang-ma commented Oct 4, 2024

github-actions bot commented Oct 4, 2024

simorenoh commented Oct 4, 2024 •

edited

Loading

github-actions bot commented Oct 4, 2024

yanfang-ma commented Oct 7, 2024

simorenoh commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

yanfang-ma commented Oct 9, 2024

xiangyan99 commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

simorenoh commented Jan 8, 2025

Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721

Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721

Comments

yanfang-ma commented Oct 4, 2024

github-actions bot commented Oct 4, 2024

simorenoh commented Oct 4, 2024 • edited Loading

github-actions bot commented Oct 4, 2024

yanfang-ma commented Oct 7, 2024

simorenoh commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

yanfang-ma commented Oct 9, 2024

xiangyan99 commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

simorenoh commented Jan 8, 2025

simorenoh commented Oct 4, 2024 •

edited

Loading