Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register in the client context a state to avoid reconnecting on Azure. #32

Merged
merged 3 commits into from
Feb 19, 2024

Conversation

quentingodeau
Copy link
Contributor

This commit refactors the code (mostly moves it) to ease readability. It also add an Azure context that will be kept between the files access in a query. Previously if you were using a credential_chain provider when you query multiple files for each of them, we will have initiate a new connection and identify at the Azure AD (now Entra). Now this will be only performed once by query.

@samansmink
Copy link
Collaborator

samansmink commented Feb 1, 2024

hey @quentingodeau thanks a lot for the PR!

The CI failure for the distribution pipeline ci can be fixed by bumping the version of the reusable workflow duckdb/duckdb/.github/workflows/_extension_distribution.yml
in .github/workflows/MainDistributionPipeline.yml to latest master of duckdb (currently: 64785543f882f8c9b2f81ae71fa43eadf1653573)

edit: and the Linux Release one seems some upstream server thats dead, i just ran into the same err on another pr

@quentingodeau
Copy link
Contributor Author

Oh I misunderstood your comment, I rollback!

@quentingodeau
Copy link
Contributor Author

Hi @samansmink, does the change on the workflow are the one you ask me to do ? I saw in your fork that you have update a sha, it that what I was suppose to do ?

This commit refactors the code (mostly moves it) to ease readability. It
also add an Azure context that will be kept between the files access in
a query. Previously if you were using a credential_chain provider
when you query multiple files for each of them, we will have initiate a
new connection and identify at the Azure AD (now Entra). Now this will
be only performed once by query.
@quentingodeau
Copy link
Contributor Author

Hi again @samansmink,
for info I have update the PR with the changes that you have done in the main branch :)
regards,
Quentin

@samansmink
Copy link
Collaborator

hi @quentingodeau sorry for the delay on all your PRs, i've been on holiday last week and very busy with the 0.10 release this week. I'll review this one and the rest asap!

@quentingodeau
Copy link
Contributor Author

quentingodeau commented Feb 13, 2024

No problem!
Take your time :)
I known that I have move a lot of things and sorry for that, but I saw that the way the extension/unit test was build I was causing issue when you add some headers (because the includes options where not propagate with CMake)

Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @quentingodeau thanks a lot for the PR, caching the BlobServiceClient for the duration of the query seems like a great idea!

Scanning a ~200 MB parquet file on a decently fast network I got a speedup from ~11.5s to ~9.4s with this PR. I would expect the difference to be even bigger for higher latency networks

I added 2 minor comments, the rest looks good!

AzureContextState *result = nullptr;

auto &storage_account = client_context->registered_state[DEFAULT_AZURE_STORAGE_ACCOUNT];
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not a big fan of the try-catch statement here. Could you rewrite this to not use one? I think something like:

auto lookup = client_context->registered_state.find(DEFAULT_AZURE_STORAGE_ACCOUNT);
if (lookup == client_context->registered_state.end()) {
	// Create one and insert
} else {
	// Refresh if invalid, otherwise return 
}

is a lot cleaner, even though it might require 2 lookups in the registered_state map for AzureContextState creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I wanted to avoid the two lookup, but you are right it an unordered map so it should be almost like O(1) complexity. I kept some old habit when only map was available is the std^^
Done in ec242c0

return {container, prefix, path};
}

std::shared_ptr<AzureContextState> AzureStorageFileSystem::GetOrCreateStorageAccountContext(FileOpener *opener,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we may be add an option for azure to enable/disable the caching feature? I think that would be nice to have as a workaround in case this causes issues for people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sens done in ba56a50

@quentingodeau
Copy link
Contributor Author

Greetings @samansmink thx for the review!!

@samansmink
Copy link
Collaborator

Looks great now, thanks again!

@samansmink samansmink merged commit 923ff39 into duckdb:main Feb 19, 2024
18 checks passed
@quentingodeau quentingodeau deleted the feature/performance branch February 19, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants