Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for identity based access to Azure using DefaultAzureCredential #18931

Closed
francesco086 opened this issue Sep 25, 2024 · 11 comments
Closed
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@francesco086
Copy link

Description

As pointed out in #11520, it is not possible to use anon=False in storage_options when reading data from the cloud.

I work with Azure, and as far as I know the only possibility to access the data using the DefaultAzureCredential is in this way (see https://stackoverflow.com/questions/74136425/connecting-to-azure-storage-account-to-read-parquet-file-via-managed-identity-us).
I could not find any information in the documentation about alternative ways, and AzureConfigKey does not seem to support this feature.

As identity based access is the golden standard in security, I think that it would be very important to support this feature.

@francesco086 francesco086 added the enhancement New feature or an improvement of an existing feature label Sep 25, 2024
@deanm0000
Copy link
Collaborator

I'm 99% sure that you'll need to inquire with Object Store about this since that's what polars uses for cloud connectivity.

As an aside, have you tried this?

@francesco086
Copy link
Author

@deanm0000 many thanks for the link to the azure page, I didn't know of the possibility of an user delegation SAS!
It works :) It is a bit cumbersome and not so obvious though, it would be nice to add this small hint in the polars documentation (I would be happy to contribute).

Let me first explore other scenarios and in case get in touch with the Object Store devs. I will come back to this.

Or, if you think it's not worth it, you can close this ticket.

@deanm0000
Copy link
Collaborator

I think it'd be good to put a note in the docs about it but I don't have any say on which PRs get accepted so you'll just have to try and see what happens.

@francesco086
Copy link
Author

How-to summary

I leave some notes for whoever ends up here with the same problem as mine

  • a behavior like the adlfs anon=True, that uses DefaultAzureCredential, is not possible with polars, because it uses Object Store instead of adlfs, and Object Store does not (nor will in the future) support that feature

  • if you want to authenticate using the identity of the Azure compute instance you are using (VM, kubernetes, etc.) then do something like this:

    pl.scan_parquet(source="az://your/path", storage_options={"account_name": "<ACCOUNT_NAME>"})

    In this way polars will source your credentials from IMDS.

  • if you want to authenticate using the identity that you used to login from CLI (az login), then do

    pl.scan_parquet(source="az://your/path", storage_options={"account_name": "<ACCOUNT_NAME>", "use_azure_cli": "True"})

Hope it will be helpful for someone

@edgBR
Copy link

edgBR commented Nov 6, 2024

Hi,

My two cents here. If this is being implemented: https://docs.pola.rs/api/python/dev/reference/api/polars.CredentialProvider.html

The easier way would be to retrieve the token ID via azure identity and then pass it as auth method to the storage options. Something like:

def get_chained_credentials():
        """Creates and returns a chained token credential for Azure authentication.

        This function initializes a `ChainedTokenCredential` instance that combines
        a managed identity credential and a default credential.

        Returns
        -------
        chained_creds : ChainedTokenCredential
            A chained token credential combining `ManagedIdentityCredential` and
            `DefaultAzureCredential`, allowing applications to authenticate using
            either managed identity or default credentials.
        """
        try:
            chained_creds = ChainedTokenCredential(
                AzureCliCredential(),
                ManagedIdentityCredential(client_id=CLIENT_ID),
            )
            return chained_creds
        except Exception as e:
            raise e

With this we can get a chained token credential with different auth methods but use always the token auth in polars.

@daviewales
Copy link

I have kind of the opposite problem.

On my local machine, storage_options = {"account_name": "mystorageaccount"} suffices to access a storage account using my existing az login session.
However, on an Azure VM with managed identity, Object Store prioritises the token retrieved from the IMDS.
Unfortunately in my case, the VM's managed identity does not have access to the resource in question.

I've tried adding extending the storage_options with "use_azure_cli": "True" to force it to use my az login session.
However, it doesn't appear to work.

@nameexhaustion
Copy link
Collaborator

I have kind of the opposite problem.

On my local machine, storage_options = {"account_name": "mystorageaccount"} suffices to access a storage account using my existing az login session. However, on an Azure VM with managed identity, Object Store prioritises the token retrieved from the IMDS. Unfortunately in my case, the VM's managed identity does not have access to the resource in question.

I've tried adding extending the storage_options with "use_azure_cli": "True" to force it to use my az login session. However, it doesn't appear to work.

@daviewales could you ensure that the azure-identity Python package is installed? If this does not help, please open a new issue report.

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Jan 29, 2025

I will close this issue as the latest versions of polars now automatically use DefaultAzureCredential() if azure-identity installed. Passing a custom credential_provider function is also an option for more advanced authentication flows.

Closed as completed via #20384

@nameexhaustion nameexhaustion changed the title Support for identity based access to cloud data Support for identity based access to Azure using DefaultAzureCredential Jan 29, 2025
@nameexhaustion nameexhaustion self-assigned this Jan 29, 2025
@daviewales
Copy link

Thanks @nameexhaustion, I believe that the DefaultAzureCredential is working as intended. My issue is that DefaultAzureCredential prioritises the VM's Managed Identity over my az login identity.
It's possible to override this by creating my own credential as follows:

from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential(exclude_managed_identity_credential=True)

However, it's not clear to me how to tell Polars to use this.

I've tried passing credential_provider=credential (using the credential above), but this doesn't work.
In Pandas, I can do storage_options = {"account_name": "mystorageaccount", "anon": "true", "credential": credential}.
But in Polars, storage_options values must be String type.

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Jan 29, 2025

@daviewales The credential provider given must also return an expiry time. It should work if you create a custom function to wrap it:

def credential_provider():
    credential = DefaultAzureCredential(exclude_managed_identity_credential=True)
    token = credential.get_token()

    return { 
         "bearer_token": token.token, 
    }, token.expires_on

q = pl.scan_parquet(..., credential_provider=credential_provider)

For reference, this is what we have internally:

token = credential.get_token(*self.scopes, tenant_id=self.tenant_id)
return {
"bearer_token": token.token,
}, token.expires_on

@daviewales
Copy link

Thanks @nameexhaustion, that got me most of the way there. This missing piece of the puzzle is that credential.get_token() requires scopes as positional arguments. Polars uses scopes = ["https://storage.azure.com/.default"] here:

scopes if scopes is not None else ["https://storage.azure.com/.default"]

So, the minimal working definition of a credential_provider which excludes managed identity credentials is as follows:

def credential_provider():
    credential = DefaultAzureCredential(exclude_managed_identity_credential=True)
    token = credential.get_token("https://storage.azure.com/.default")

    return {"bearer_token": token.token}, token.expires_on

You can then use it as expected:

pl.scan_parquet(
    'az://container/table.parquet',
    storage_options={'account_name': 'myaccount'},
    credential_provider=credential_provider
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants