-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XrdAdaptor: Fix race condition leading to a crash when doing Prepare … #32763
Conversation
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32763/20919
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
f634ec8
to
1b23289
Compare
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32763/20920
|
A new Pull Request was created by @esindril (Elvin Sindrilaru) for master. It involves the following packages: Utilities/XrdAdaptor @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild, please test |
@Dr15Jones @bbockelm Could you take a look too? |
I can't make any reasonable comments on this since I have no knowledge of how xrootd handles its communication. |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5894ef/12586/summary.html Comparison SummarySummary:
|
@bbockelm I've pushed a new commit to use a smart pointer for handling the raw response buffer. Logically the file system object for which an async response handler is used should not be deleted until the async response handler is called (i.e. the response arrives) - this is a gist of the problem here and the patch fixes this race condition. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32763/24391
|
Pull request #32763 was updated. @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please check and sign again. |
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5894ef/17459/summary.html Comparison SummaryThe workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons Summary:
|
I spotted this problem while working on bumping xrootd to 5.3.x and fixed in basically the same way (without removing the handler class and unique_ptr), see #34700. I propose to close this one and go one with the other one ... I will apply the same additional changes that were done here, removal of handler class and the data member. The unique_ptr change is not correct in this PR ... one would need to do:
and then one doesn't need to do anything later ... I can fix it in the other PR if people agree. |
@osschar your suggested code will not compile as you are trying to bind an lvalue reference expected by the Prepare API to an rvalue returned by get on the unique pointer. This is why you need the extra code ... I think it's correct. |
Oh, sorry, there is no pointer& variant of get() ... then I don;t know what Brian meant. Isn't what is in now equivalent to just calling delete? If all this is correct, one would have to catch the exceptions, delete buffer and rethrow ... unless Prepare() would also have a signature with unique_ptr argument (which would be possible with 5.3 as it uses c++-14). Is it even possible Prepare() would throw an exception? It wouldn't be a major leak in any case :) |
As discussed in the Aug 17 core mtg and later confirmed by Brian, async call with discard of potential errors is the correct thing to do for Prepare as its most important effect is to warm up redirector caches (and potentially XCaches in the future). This will be implemented properly in #34700 ... please close this PR. |
-1 Thanks. |
Superseeded by the already merged #34700 |
PR description:
This pull request fixes a crash in the XrdAdaptor class where the async Prepare request can lead to a SEGV if the corresponding XrdCl::FileSystem is destroyed before the response is processed. This can be easily reproduced in every version of the CMSSW.
This race condition is triggered when the response to the Prepare request is slow and the memory occupied by the file XrdCl::FileSystem object is reused leading to a crash when the XrdCl::FileSystem destructor is called. The XrdCl::FileSystem goes out of scope in the
stagein
method and it is therefore destroyed, but the Prepare response still needs a valid XrdCl::FileSystem object.An example of a fully reproducible crash:
Also the crash is easier to reproduce if the Prepare request needs to be redirected to a different server as it is the case in EOS or if the Prepare response is slow.
Given the fact that one needs to wait for the prepare response in the
stagein
part of the code there is absolutely no reason to use the async Prepare interface from XrdCl - so I've modified it to use the synchronous Prepare call.PR validation:
A simple rerun of the test confirms that the fix is working properly.