Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For async calls always store XrdCl::FileSystem with the response-handler. #35455

Conversation

osschar
Copy link
Contributor

@osschar osschar commented Sep 28, 2021

This is a continuation of #34700.

XrdCl's internal response handlers (before user-supplied handler is called) access URL string that is part of the FileSystem object. As FileSystemObject was created on the stack, this gets destroyed while async request are still pending.

This PR moves FileSystem object into our response-handlers so it's lifetime is the same as that of the response. All response handlers auto-destruct as needed.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @osschar (Matevž Tadel) for CMSSW_12_1_DEVEL_X.

It involves the following packages:

  • Utilities/XrdAdaptor (core)

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

test parameters:

  • enable_tests = threading

@makortel
Copy link
Contributor

@cmsbuild, please test

@makortel
Copy link
Contributor

Thanks @osschar!

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7b09b3/19209/summary.html
COMMIT: 360ca4b
CMSSW: CMSSW_12_1_DEVEL_X_2021-09-27-2300/slc7_amd64_gcc900
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35455/19209/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 40
  • DQMHistoTests: Total histograms compared: 3211080
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3211058
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 39 files compared)
  • Checked 169 log files, 37 edm output root files, 40 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

+1

Base branch is already DEVEL.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_1_DEVEL_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Sep 29, 2021

+1

@cmsbuild cmsbuild merged commit 9ba0bf8 into cms-sw:CMSSW_12_1_DEVEL_X Sep 29, 2021
@smuzaffar
Copy link
Contributor

This has been integrated in DEVEL IBs and we now have two DEVEL IBs with out any failure. Thanks a lot @osschar for these changes

@smuzaffar
Copy link
Contributor

backport done
Successfully backported PR #35455 as #35481 for branch master

@osschar
Copy link
Contributor Author

osschar commented Sep 30, 2021

Yay, good news :)

Forgot to mention before, yesterday I also went over the dead-lock-like stack-traces Dan pointed to previously ... and I realized it was the same thing, that XrdCl response handler came back upon a released FileSystem object and what was supposed to be its internal mutex was in a presumed locked state ... and this blocked the destruction of XrdCl at the end of the job.

I also asked Michal about stack-allocated FileSystem object vs. xroot-5.3 and he says this should never have worked, even with 4.12, that the FS object needs to stay alive until the async handler completes. Now, since this (apparently) never gave us trouble, I believe the calls in question were really implemented as synchronous calls in 4.12.

Once this is merged in master (thank you!) I will also add an increased source-reselection timeout after server response is "max-number-of-redirections-reached" as it is unlikely any further open requests could result in new sources.

I am somewhat tempted to investigate if xrootd-5.3 could be backported to some older releases ... in particular those that will be used for analysis as the changes introduced here allow for a more error tolerant configuration of XCache clusters in view of errors encountered on individual cache servers. [ Alternatively, if storage.xml is (or can be made) CMSSW release-dependent, one could operate two cache redirectors with different settings for releases that have 5.3 and those that still have 4.12. I'll check how this works with computing guys and what releases are realistically expected to be used for physics in the near future (I suspect I know what the answer to this will be, sigh :) ). ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants