Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos DB Python SDK is not able to retry request in the new write region during failover #37721

Closed
yanfang-ma opened this issue Oct 4, 2024 · 10 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@yanfang-ma
Copy link

  • Package Name: azure-cosmos
  • Package Version: 4.7.0
  • Operating System: Windows 11
  • Python Version: 3.12.7

Describe the bug

As is mentioned in document https://learn.microsoft.com/en-us/azure/reliability/reliability-cosmos-db-nosql#service-managed-failover,
Regional failovers are detected and handled in the Azure Cosmos DB client. They don't require any changes from the application.

So I did some testing locally. My Cosmos DB account is a single region write account, and I added another region as the read region.
I have a python script keeping write documents into the Cosmos DB container.

To Reproduce
Steps to reproduce the behavior:
Write a python script to insert document into container consistently( in my code I did a create_item operation every 20 seconds)
When the script is running, I triggered "Change write region" operation on Azure Portal.
The my python script crashed with below trace:
_Traceback (most recent call last):
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 99, in refresh_endpoint_list

raise e
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 97, in refresh_endpoint_list

_self._refresh_endpoint_list_private(database_account, **kwargs)
File "C:\XXX\azure\cosmos_global_endpoint_manager.py", line 111, in refresh_endpoint_list_private
database_account = self.GetDatabaseAccount(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

...
RecursionError: maximum recursion depth exceeded

And I checked my diagnostic logs , I saw the create document request got status code 403, and the SDK didn't do a retry in the newly prompted write region.

I even added below connectionPolicy in my code, but still didn't work.
connectionPolicy = documents.ConnectionPolicy()
connectionPolicy.EnableEndpointDiscovery=True
connectionPolicy.PreferredLocations = [ 'original write region','original read region']
client = CosmosClient(endpoint, key,connection_policy=connectionPolicy)

I also did the same testing using Cosmos DB .NET SDK. .NET SDK works fine during region failover, I only had ConnectionMode setting in my CosmosClientOptions. But no matter using Gateway Mode or Direct Mode, .NET SDK correctly detected and handled regional failover.

Expected behavior
During regional failover, Python SDK should be able to detect the event, and retry the request in newly promoted write region.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. labels Oct 4, 2024
Copy link

github-actions bot commented Oct 4, 2024

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @AbhinavTrips @bambriz @Pilchie @pjohari-ms @simorenoh.

@simorenoh
Copy link
Member

simorenoh commented Oct 4, 2024

Hi @yanfang-ma, thanks for reaching out. You are absolutely correct - this is a gap we have recently identified in our SDK, and have already been working on it. The PR is here for your reference: #36514

We are just finalizing the testing work on our end to make sure we don't miss any edge cases, but these resiliency and reliability improvements to the SDK should be getting released later this month. Thank you for using our SDK, and hope this has addressed your concerns - I can also ping on this issue once the release is out if you'd like!

@simorenoh simorenoh added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Oct 4, 2024
Copy link

github-actions bot commented Oct 4, 2024

Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@yanfang-ma
Copy link
Author

Hi @simorenoh

Thanks for the information provided.

For accounts with multiple regions but only 1 write region, after the feature is in place in new SDK version, is there any policy setting required in the code to let SDK automatically retry the request in newly write region during regional
failover?

I'm asking this question because for multi region write accounts, according to document https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-multi-master?tabs=api-async#python, connection policy is required.
So I would like to double check with you if any policy setting is required for single region write accounts.

I think the difference between single region write and multi region write is that, during primary region outage there will be failover at server side for single region write(if service managed failover is enabled), but for multi region write no failover will happen at server side.

@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Oct 7, 2024
@simorenoh
Copy link
Member

Hi @yanfang-ma, no there is no additional necessary setting for a single write region account - while this functionality is currently missing in the Python SDK, much like the initial document you linked says it would be the default behavior.

For an account with write region A and read region B for instance, upon a service managed failover region B will be made into a write region and requests will be automatically routed to it by the SDK client with no additional configurations needed.

@simorenoh simorenoh added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Oct 7, 2024
Copy link

github-actions bot commented Oct 7, 2024

Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@simorenoh simorenoh self-assigned this Oct 7, 2024
@simorenoh simorenoh removed the needs-author-feedback Workflow: More information is needed from author to address the issue. label Oct 7, 2024
@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Oct 7, 2024
@yanfang-ma
Copy link
Author

Hi @simorenoh

Thank you so much for the clarification. As you mentioned the new version with the improvements should be getting released later this month, may I know if the new version will be a beta version or official version?
I checked the change log https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/cosmos/azure-cosmos/CHANGELOG.md and I see there are beta versions.

And do we have ETA for the official version which includes the improvements?

@xiangyan99
Copy link
Member

https://pypi.org/project/azure-cosmos/4.9.0/ has been released.

@xiangyan99 xiangyan99 added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Jan 8, 2025
Copy link

github-actions bot commented Jan 8, 2025

Hi @yanfang-ma. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jan 8, 2025
@simorenoh
Copy link
Member

version 4.8.0 has the relevant changes made publicly available - forgot to come back to this to close it but these changes are live.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
Status: Done
Development

No branches or pull requests

3 participants