-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register in the client context a state to avoid reconnecting on Azure. #32
Conversation
hey @quentingodeau thanks a lot for the PR! The CI failure for the distribution pipeline ci can be fixed by bumping the version of the reusable workflow edit: and the Linux Release one seems some upstream server thats dead, i just ran into the same err on another pr |
Oh I misunderstood your comment, I rollback! |
e5bd770
to
bd92fdb
Compare
Hi @samansmink, does the change on the workflow are the one you ask me to do ? I saw in your fork that you have update a sha, it that what I was suppose to do ? |
This commit refactors the code (mostly moves it) to ease readability. It also add an Azure context that will be kept between the files access in a query. Previously if you were using a credential_chain provider when you query multiple files for each of them, we will have initiate a new connection and identify at the Azure AD (now Entra). Now this will be only performed once by query.
bd92fdb
to
7891cdb
Compare
Hi again @samansmink, |
hi @quentingodeau sorry for the delay on all your PRs, i've been on holiday last week and very busy with the 0.10 release this week. I'll review this one and the rest asap! |
No problem! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @quentingodeau thanks a lot for the PR, caching the BlobServiceClient for the duration of the query seems like a great idea!
Scanning a ~200 MB parquet file on a decently fast network I got a speedup from ~11.5s to ~9.4s with this PR. I would expect the difference to be even bigger for higher latency networks
I added 2 minor comments, the rest looks good!
src/azure_filesystem.cpp
Outdated
AzureContextState *result = nullptr; | ||
|
||
auto &storage_account = client_context->registered_state[DEFAULT_AZURE_STORAGE_ACCOUNT]; | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not a big fan of the try-catch statement here. Could you rewrite this to not use one? I think something like:
auto lookup = client_context->registered_state.find(DEFAULT_AZURE_STORAGE_ACCOUNT);
if (lookup == client_context->registered_state.end()) {
// Create one and insert
} else {
// Refresh if invalid, otherwise return
}
is a lot cleaner, even though it might require 2 lookups in the registered_state map for AzureContextState
creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I wanted to avoid the two lookup, but you are right it an unordered map so it should be almost like O(1) complexity. I kept some old habit when only map was available is the std^^
Done in ec242c0
return {container, prefix, path}; | ||
} | ||
|
||
std::shared_ptr<AzureContextState> AzureStorageFileSystem::GetOrCreateStorageAccountContext(FileOpener *opener, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we may be add an option for azure to enable/disable the caching feature? I think that would be nice to have as a workaround in case this causes issues for people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sens done in ba56a50
Greetings @samansmink thx for the review!! |
Looks great now, thanks again! |
This commit refactors the code (mostly moves it) to ease readability. It also add an Azure context that will be kept between the files access in a query. Previously if you were using a credential_chain provider when you query multiple files for each of them, we will have initiate a new connection and identify at the Azure AD (now Entra). Now this will be only performed once by query.