Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a pattern to wait for RBAC propagation #1567

Open
pakrym opened this issue Apr 21, 2021 · 13 comments
Open

Provide a pattern to wait for RBAC propagation #1567

pakrym opened this issue Apr 21, 2021 · 13 comments
Labels
EngSys This issue is impacting the engineering system.

Comments

@pakrym
Copy link
Contributor

pakrym commented Apr 21, 2021

We have multiple services that hit a problem where their live tests are failing because RBAC permissions were not propagated in time. We also had many ways each of them tries to solve the problem:

By waiting pre-defined amount of time: https://github.com/Azure/azure-sdk-for-net/blob/master/sdk/eventhub/test-resources-post.ps1?rgh-link-date=2021-04-21T00%3A02%3A41Z#L15-L16

Or having retry on tests: Azure/azure-sdk-for-net#20559

To me, the goal of New-TestResources is to leave the test environment fully ready for the test run and non-propagated RBAC is something we should solve in a centralized way.

@pakrym pakrym added the EngSys This issue is impacting the engineering system. label Apr 21, 2021
@pakrym
Copy link
Contributor Author

pakrym commented Apr 21, 2021

cc @benbp @heaths

@kasobol-msft
Copy link
Contributor

@heaths
Copy link
Member

heaths commented Apr 21, 2021

There is a retry already, and the Az cmdlets have a back off period (doubles).

@pakrym
Copy link
Contributor Author

pakrym commented Apr 21, 2021

It's not RBAC for the newly-created resource group it's for the resources created as part of template deployment.

@heaths
Copy link
Member

heaths commented Apr 21, 2021

I see. But why make everyone wait if not necessary? That was the whole reason of test-resources-post.ps1 - to allow service-specific actions post deployment. For Key Vault, for example, I have to activate Managed HSM - something that can't be done via ARM templates by design. Most services don't have this problem. In fact, RBAC assignments for Managed HSM don't take this long (they are data-plane RBAC that mimic control-plane RBAC, so perhaps that's why).

The purpose of the scripts is you want is still met. Users don't run test-resources-(pre|post).ps1 themselves. The scripts do that, so users just run the scripts for a service and it all works out in the end.

@pakrym
Copy link
Contributor Author

pakrym commented Apr 21, 2021

I'm saying that unconditionally waiting is a bad solution and we need something better to recommend to partners.

@heaths
Copy link
Member

heaths commented Apr 21, 2021

The test-resources-post.ps1 workaround seems good. That's why those scripts are used if present. I could update documentation to detail some example uses if you think that would help.

@pakrym
Copy link
Contributor Author

pakrym commented Apr 21, 2021

The test-resources-post.ps1 workaround seems good.

I disagree, it's slow and unreliable. It doesn't guarantee that RBAC propagates.

@weshaggard
Copy link
Member

I agree we should at least come up with a pattern for this RBAC propagation issue so that folks can use that in their post scripts if needed. @pakrym do you know if there is any way to query the status?

@heaths
Copy link
Member

heaths commented Apr 21, 2021

This is probably a better question for the ARM team. I'm not aware of any way to make sure RBAC has propagated, and it may well depend on some RPs' implementation. Still, I doubt this is something we could put in the New-TestResources.ps1 script unless there's some obvious way of detecting that RBAC permissions were even assigned - which can be done in test-resources-post.ps1, which Key Vault is doing for Managed HSM test resources.

@heaths
Copy link
Member

heaths commented Apr 21, 2021

There is Get-AzRoleAssignment. A pattern in docmentation or even inheritted (via scope) function could maybe loop on that. Initially, I'm in favor of documenting a pattern since providing all the data necessary may be more verbose and cause compat issues later if we have to change it.

@pakrym
Copy link
Contributor Author

pakrym commented Apr 21, 2021

. @pakrym do you know if there is any way to query the status?

It seems that you have to actually call the service to know. @kasobol-msft is working on a prototype that allows partners to define a way to check if the environment is ready as part of the .net testing framework. We would make a simple call and see if it fails with auth error and keep retrying.

@kasobol-msft
Copy link
Contributor

kasobol-msft commented Apr 21, 2021

@heaths Problem is that Get-AzRoleAssignment might give you the assignments but they might still in flight to arrive at storage service side. If storage backend hasn't yet consumed the change then calls to storage fail with 403.

I don't like idea of waiting for RBAC at resource provisioning step. That time can be used to advance pipeline (especially that pessimistic case is 5 minutes). Other thing is that I'd need to build something that probes some API and checks for 403s - this is easier with SDK in hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EngSys This issue is impacting the engineering system.
Projects
None yet
Development

No branches or pull requests

4 participants