Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure ADLS Gen2 with HNS: Unable to delta_scan FQDN abfss URL #71

Closed
gdubya opened this issue Aug 13, 2024 · 3 comments
Closed

Azure ADLS Gen2 with HNS: Unable to delta_scan FQDN abfss URL #71

gdubya opened this issue Aug 13, 2024 · 3 comments

Comments

@gdubya
Copy link
Contributor

gdubya commented Aug 13, 2024

Using a FQDN HNS hostname causes an error.

For example (where the secret has already been configured for mystorageaccount), this works:

select count(*) from delta_scan('abfss://mycontainer/mydelta');

But this fails with a "bad request" error:

select count(*) from delta_scan('abfss://mycontainer@mystorageaccount.dfs.core.windows.net/mydelta');

I think the code for determining the bucket variable needs to be updated (around line 173)?

        auto end_of_fqdn_container = path.find('@', 8);
        // An abfs(s) URL may contain the FQDN for the storage account 
        // e.g. "abfss://container@storage.dfs.core.windows.net/some/path"
        if (StringUtil::StartsWith(path, "abfss://") && (end_of_fqdn_container != string::npos) {
            bucket = path.substr(8, end_of_fqdn_container-8);
        } else {
            bucket = path.substr(8, end_of_container-8);
        }

Also the "azure_endpoint" is being hard-coded to https://" + account_name + ".**blob**.core.windows.net/ instead of .dfs.. I guess the endpoint builder option does not need to be set when a FQDN is used? or, if endpoint is empty or null then use the value of the duckdb azure_endpoint variable? (select current_setting('azure_endpoint');)

I'm having trouble setting up my local environment for developing and testing the extension myself so this is currently just guesswork.

@gdubya
Copy link
Contributor Author

gdubya commented Aug 14, 2024

ah, with AZURE_LOG_LEVEL=verbose I can see that the first HEAD request sent (by the Delta kernel?) uses the wrong syntax when the abfss URL uses the FQDN:

[2024-08-14T13:14:25.3079187Z T: 140174094714560] INFO : HTTP Request : HEAD https://mycontainer@mystorage.blob.core.windows.net/mydelta/part-00000-997f2808-9277-4450-a24f-ba362d847604.c000.snappy.parquet

Using the "shorthand" abfss syntax, the HEAD request uses the correct format URL:

[2024-08-14T13:18:54.0109700Z T: 139716156116544] INFO : HTTP Request : HEAD https://mystorage.blob.core.windows.net/mycontainer/mydelta/part-00000-997f2808-9277-4450-a24f-ba362d847604.c000.snappy.parquet

So, now to figure out how that happened.

@gdubya
Copy link
Contributor Author

gdubya commented Aug 26, 2024

After building the delta extension with my fork of the duckdb_azure extension it looks like this problem is fixed.

@samansmink
Copy link
Collaborator

Will close this one as it is fixed on nightlies now and scheduled for release in upcoming duckdb v1.1.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants