Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate runtime override catalog mechanism (storage.xml vs storage.json) #11472

Closed
amaltaro opened this issue Feb 3, 2023 · 20 comments · Fixed by #11473
Closed

Re-evaluate runtime override catalog mechanism (storage.xml vs storage.json) #11472

amaltaro opened this issue Feb 3, 2023 · 20 comments · Fixed by #11473

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Feb 3, 2023

Impact of the new feature
WMAgent and Tier0 agent

Is your feature request related to a problem? Please describe.
As briefly discussed in this CMSTalk link:
https://cms-talk.web.cern.ch/t/invalid-override-string-overriding-tfc-in-cmssw-12-6-2/19904/6

German found out that WMAgent creates an override XML catalog for jobs with multiple cmsRun steps. Likely to read the input file from the previous step, stored in the local scratch area.

Matti said that CMSSW can't support - without a recompilation/deployment of CMSSW - both storage.xml (File Catalog) and storage.json (Rucio Catalog). So we need to find a way to replace the current override mechanism, otherwise we will be forever bound to the CMSSW version and what is actually supported by it.

Describe the solution you'd like
There are a few tasks to be performed within this issue:

  1. first, understand why the override catalog is created and which information it provides;
  2. second, evaluate alternatives to having this override catalog (perhaps make a PSet tweak in the subsequent jobs)
  3. implement a new mechanism that is compatible with whatever CMSSW release the job is running with.

I acknowledge that technical details are still missing here and further discussion needs to happen with the CMSSW experts.

UPDATE: LogCollect jobs are also failing to execute edmCopyUtil. Here is one RelVal example using CMSSW13_3_0_pre3 release [1].

Describe alternatives you've considered
Releases older than CMSSW_12_6_2 already support storage.json. But in order to run real tests, we need to confirm whether a given release has been deployed with support to FileCatalog or Rucio Catalog.

Additional context
This - under development - pull request will add support to storage.json in WMAgent: #11440

One might ask, why didn't things fail in the Release Validation workflows. I think the reason for that is that we use TaskChain workflows for RelVals, thus always single cmsRun process within a job.

[1]

2023-01-31 05:14:26,505:INFO:LogCollect:Using edmCopyUtil: True
2023-01-31 05:14:26,505:INFO:LogCollect:Running scram
2023-01-31 05:14:32,326:INFO:LogCollect:Running edmCopyUtil
2023-01-31 05:14:32,326:INFO:Scram:Creating a subprocess to run the PSet setup.
2023-01-31 05:14:32,326:INFO:Scram:Also recording SCRAM command-line related output.
2023-01-31 05:14:32,336:INFO:Scram:    Invoking command: env X509_USER_PROXY=/srv/myproxy.pem edmCopyUtil /store/unmerged/logs/prod/2023/1/30/pdmvserv_RVCMSSW_13_0_0_pre3_G4VECGEOMBuMixing_14__G4VECGEOM_230130_125911_7667/DigiPU_2021PU/DigiPU_2021PUMergeFEVTDEBUGHLToutput/RecoNanoPU_2021PU/0000/0/2276bb14-5f50-4998-9b63-3511881e33da-0-0-logArchive.tar.gz /srv/job/WMTaskSpace/logCollect1
2023-01-31 05:14:32,536:INFO:Scram:Subprocess stdout was:
cms::Exception caught in EdmFileUtil
An exception of category 'SiteLocalConfigService' occurred while
   [0] edm::SiteLocalConfigService::dataCatalogs()
Exception Message:
Incomplete configuration. Valid site-local-config not found at /JobConfig/site-local-config.xml
@makortel
Copy link

makortel commented Feb 3, 2023

Thanks @amaltaro for opening the issue!

Matti said that CMSSW can't support - without a recompilation/deployment of CMSSW - both storage.xml (File Catalog) and storage.json (Rucio Catalog).

In principle we could look into making the use of storage.xml and storage.json configurable in the SiteLocalConfigService. But it is my understanding that the intention of the site support is to eventually drop storage.xml and move to storage.json only (@stlammel please confirm), and after thinking a while I became concerned that providing a configuration option would only make the situation more complex than it already is.

  1. first, understand why the override catalog is created and which information it provides;

As far as I can tell from reading the code in

def handleChainedProcessingTweak(self):
"""
_handleChainedProcessing_
In order to handle chained processing it's necessary to feed
output of one step/task (nomenclature ambiguous) to another.
This method creates particular mapping in a working Trivial
File Catalog (TFC).
"""
self.logger.info("Handling chained processing job")
# first, create an instance of TrivialFileCatalog to override
tfc = TrivialFileCatalog()
# check the jobs input files
inputFile = ("../%s/%s.root" % (self.step.data.input.inputStepName,
self.step.data.input.inputOutputModule))
tfc.addMapping("direct", inputFile, inputFile, mapping_type="lfn-to-pfn")
tfc.addMapping("direct", inputFile, inputFile, mapping_type="pfn-to-lfn")
self.tweak.addParameter('process.source.fileNames', "customTypeCms.untracked.vstring(%s)" % [inputFile])
self.tweak.addParameter("process.maxEvents", "customTypeCms.untracked.PSet(input=cms.untracked.int32(-1))")
tfcName = "override_catalog.xml"
tfcPath = os.path.join(os.getcwd(), tfcName)
self.logger.info("Creating override TFC and saving into '%s'", tfcPath)
tfcStr = tfc.getXML()
with open(tfcPath, 'w') as tfcFile:
tfcFile.write(tfcStr)
self.step.data.application.overrideCatalog = "trivialcatalog_file:" + tfcPath + "?protocol=direct"
return

does the following

  • craft a path to the input file (inputFile), which is local to the worker node (as far as CMSSW is concerned)
  • set up TFC that maps the input file to itself for both LFN->PFN and PFN->LFN mappings
  • tweaks PSet by setting process.source.fileNames to use inputFile
  • overrides the TFC for the job

Do I understand correctly that the intention is to tell for a cmsRun process in a chain to read a local file?

Is this function specific to Tier0? Or how is this case handled in StepChains? I'm just wondering because I've understood we have MC production workflows using CMSSW_12_6_X, but I haven't heard any (this kind of) complaints from there.

  • second, evaluate alternatives to having this override catalog (perhaps make a PSet tweak in the subsequent jobs)

If I understood the intention of the handleChainedProcessingTweak() code correctly, I think the function could simply do e.g.

        inputFile = ("file:../%s/%s.root" % (self.step.data.input.inputStepName,
                                        self.step.data.input.inputOutputModule))

        self.tweak.addParameter('process.source.fileNames', "customTypeCms.untracked.vstring(%s)" % [inputFile])

and drop the TFC handling. The file: "protocol" prefix tells cmsRun to use a local file directly, bypassing the file catalog altogether. This approach would work on any CMSSW release.

TODO: Evaluate how critical and how we should prioritize this development.

My core perspective it would be great to have the fix deployed so that we would not have to revert to TFC in 13_0_X. That would set the timeline to when Tier0 needs to start testing 13_0_X in light of the pp data taking.

@stlammel
Copy link

stlammel commented Feb 3, 2023

Yes, the plan is to phase out storage.xml in the long term. (With a auto-generation step from storage.json in the midterm as needed.)

  • Stephan

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 3, 2023

@makortel Hi Matti,

Do I understand correctly that the intention is to tell for a cmsRun process in a chain to read a local file?

yes, this is my understanding as well. It could be that not having the override file will have other unexpected effects down the chain (for staging out, or creating the job report), but I find that unlikely.

Is this function specific to Tier0? Or how is this case handled in StepChains? I'm just wondering because I've understood we have MC production workflows using CMSSW_12_6_X, but I haven't heard any (this kind of) complaints from there.

I just happened to find that we are missing log records for that SetupCMSSWPSet module. But looking at one StepChain workflow construction, I see it has the following set:

cmsunified_task_TSG-Run3Summer22EEGS-00087__v1_T_230126_153254_527.tasks.TSG-Run3Summer22EEGS-00087_0.steps.cmsRun1.tree.children.cmsRun2.tree.children.cmsRun3.input.chainedProcessing = True

which would cause it to execute the handleChainedProcessingTweak method.

and drop the TFC handling. The file: "protocol" prefix tells cmsRun to use a local file directly, bypassing the file catalog altogether. This approach would work on any CMSSW release.

Thank you for this suggestion. If this change is enough for chained requests, then it might be that we can get Tier0 working with that CMSSW release in the beginning of the next week (@germanfgv heads up that a patch might be coming up for a replay test :)).

My core perspective it would be great to have the fix deployed so that we would not have to revert to TFC in 13_0_X. That would set the timeline to when Tier0 needs to start testing 13_0_X in light of the pp data taking.

Thank you Matti!

@amaltaro amaltaro self-assigned this Feb 3, 2023
@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 3, 2023

@makortel Matti, I meant to ask whether CMSSW will understand a relative path as the one you suggested above? Or should we tweak the PSet with the absolute path?

@makortel
Copy link

makortel commented Feb 3, 2023

I meant to ask whether CMSSW will understand a relative path as the one you suggested above? Or should we tweak the PSet with the absolute path?

Relative path works (and absolute path works too). The important part is the file: prefix.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 7, 2023

@makortel Hi Matti, our initial tests with the proposed PR fixing this issue were successful and we can indeed stop creating the override catalog for multi step jobs.

However, German spotted issues with the LogCollect jobs, which rely on the edmCopyUtil to stage files in (copy files from the storage area over to the worker node scratch area), example:

2023-02-07 16:19:04,142:INFO:Scram:    Invoking command: env X509_USER_PROXY=/srv/serviceproxy-vocms001.pem edmCopyUtil -c trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?pr
otocol=eos

Do you know whether it is expected? In other words, have these utils also been converted to work only with the JSON catalog format?

@makortel
Copy link

makortel commented Feb 7, 2023

our initial tests with the proposed PR fixing this issue were successful and we can indeed stop creating the override catalog for multi step jobs

Very good!

Do you know whether it is expected? In other words, have these utils also been converted to work only with the JSON catalog format?

This behavior is, in fact, expected. The edmCopyUtil (and at least edmFileUtil) got implicitly converted to work only with the JSON catalog, meaning that while their code was not changed, the code underneath changed in a way that changes behavior.

I understood from https://cms-talk.web.cern.ch/t/invalid-override-string-overriding-tfc-in-cmssw-12-6-2/19904/6 that Tier-0 had to set the SITECONFIG_PATH to point to /cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN. The same SITECONFIG_PATH setting should work for edmCopyUtil as well, i.e. the -c argument could likely be removed (assuming the protocol part does not need to be customized). Would that be feasible?

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 7, 2023

@makortel thank you for this extra piece of information. I can confirm that RelVals (using 13_3_0_pre3) are having 100% failure rate for their LogCollect jobs, and the reason is exactly the same. I have updated the initial description of this issue to consider that failure as well.

Matti, can you please confirm that the release CMSSW_12_6_3 only supports the JSON format? I found one workflow in production that I am considering to clone in my dev environment.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 8, 2023

@makortel Matti, reading the error message once again, it looks like T0 and central production actually hit different errors. Where:
[^] T0 fail to properly apply the TFC override. Here is a more complete error message, provided from German [^]
[^^] Central production actually fails to locate the site local config. Error message in [^^]

[^]

2023-02-07 16:19:04,142:INFO:Scram:    Invoking command: env X509_USER_PROXY=/srv/serviceproxy-vocms001.pem edmCopyUtil -c trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos /store/unmerged/data/logs/prod/2023/2/7/Express_Run359473_StreamCalibration_Tier0_REPLAY_2023_ID230207145447_v7145447/Express/0000/0/Express-8b6df553-8e83-4a9b-9249-c22cb316be00-0-logArchive.tar.gz /store/unmerged/data/logs/prod/202
3/2/7/Express_Run359473_StreamCalibration_Tier0_REPLAY_2023_ID230207145447_v7145447/Express/0000/0/Express-61b52558-d03a-45c1-afc9-6739d2c0e3c3-0-logArchive.tar.gz /srv/job/WMTaskSpace/logCollect1
2023-02-07 16:19:04,242:INFO:Scram:Subprocess stdout was:
cms::Exception caught in EdmFileUtil
An exception of category 'TrivialFileCatalog' occurred.
Exception Message:
TrivialFileCatalog::connect: protocol was not supplied in the contact string 

[^^]

2023-01-31 03:22:18,401:INFO:Scram:    Invoking command: env X509_USER_PROXY=/srv/myproxy.pem edmCopyUtil /store/unmerged/logs/prod/2023/1/30/pdmvserv_RVCMSSW_13_0_0_pre3_LTOG4VECGEOMQCD_FlatPt_15_3000HS_14__LTOG4VECGEOM_230130_130649_6154/DigiPU_2021PU/DigiPU_2021PUMergeFEVTDEBUGHLToutput/RecoNanoPU_2021PU/0000/0/4c777ccb-084c-4c8f-b48a-3258348d0d39-1-0-logArchive.tar.gz /store/unmerged/logs/prod/2023/1/30/pdmvserv_RVCMSSW_13_0_0_pre3_LTOG4VECGEOMQCD_FlatPt_15_3000HS_14__LTOG4VECGEOM_230130_130649_6154/DigiPU_2021PU/DigiPU_2021PUMergeFEVTDEBUGHLToutput/RecoNanoPU_2021PU/0000/0/82ae41d7-335c-4e27-b2fa-ba0926c1df6f-0-0-logArchive.tar.gz /srv/job/WMTaskSpace/logCollect1
2023-01-31 03:22:18,641:INFO:Scram:Subprocess stdout was:
cms::Exception caught in EdmFileUtil
An exception of category 'SiteLocalConfigService' occurred while
   [0] edm::SiteLocalConfigService::dataCatalogs()
Exception Message:
Incomplete configuration. Valid site-local-config not found at /JobConfig/site-local-config.xml

For the first case, even though T0 provided the protocol string, I see that it still passes the storage.xml file and perhaps this is what causes the failure.

For the latter, we do not override anything. So I wonder how edmCopyUtils used to locate the correct catalog and how it is supposed to work with the new JSON catalog? Would you know these details?

@germanfgv
Copy link
Contributor

I understood from https://cms-talk.web.cern.ch/t/invalid-override-string-overriding-tfc-in-cmssw-12-6-2/19904/6 that Tier-0 had to set the SITECONFIG_PATH to point to /cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN. The same SITECONFIG_PATH setting should work for edmCopyUtil as well, i.e. the -c argument could likely be removed (assuming the protocol part does not need to be customized). Would that be feasible?

Setting SITECONFIG_PATH was used because we were not being able to properly override the storage.json with 12_6_2. It turned out that the problem was that the override string was incorrect. I'm not sure this change will work with 12_6_3. I'll create a workflow changing SITECONFIG_PATH and without providing trivial file catalog. I'll let you know how it goes.

@stlammel
Copy link

stlammel commented Feb 8, 2023

Hallo German,
i don't think you need both. The override let's you specify one particular protocol, while the SITECONFIG_PATH setting switches to the data-access/stage-out in general (i.e. with primary and fallback protocols).
Thanks,

  • Stephan

@makortel
Copy link

makortel commented Feb 8, 2023

Matti, can you please confirm that the release CMSSW_12_6_3 only supports the JSON format?

No, 12_6_3 supports only the storage.xml.

The releases that use storage.json are everything after 12_6_0_pre2, with the exception of 12_6_X with X >= 3 (for now).

@makortel
Copy link

makortel commented Feb 8, 2023

[^] T0 fail to properly apply the TFC override. Here is a more complete error message, provided from German [^]
...
For the first case, even though T0 provided the protocol string, I see that it still passes the storage.xml file and perhaps this is what causes the failure.

Is the error with 12_6_3? It is strange, because everything should work just as before with storage.xml.

[^^] Central production actually fails to locate the site local config. Error message in [^^]
...

For the latter, we do not override anything. So I wonder how edmCopyUtils used to locate the correct catalog and how it is supposed to work with the new JSON catalog? Would you know these details?

Which release is this? (ok I see it's likely 13_0_0_pre3) In any case by default the edmCopyUtil uses the same site local config mechanism to figure out things as cmsRun does.

In the releases using storage.json, the edmCopyUtil uses the SITECONFIG_PATH environment variable to locate the site-local-config.xml, and from its location, walks upwards in the directory hierarchy to find the storage.json (number of ../ depending on whether the site is a subsite or not). The error looks like SITECONFIG_PATH would be unset (or set but empty).

@makortel
Copy link

makortel commented Feb 8, 2023

[^] T0 fail to properly apply the TFC override. Here is a more complete error message, provided from German [^]
...
For the first case, even though T0 provided the protocol string, I see that it still passes the storage.xml file and perhaps this is what causes the failure.

Is the error with 12_6_3? It is strange, because everything should work just as before with storage.xml.

I tried out the command edmCopyUtil -c trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos /store/unmerged/data/logs/prod/2023/2/7/Express_Run359473_StreamCalibration_Tier0_REPLAY_2023_ID230207145447_v7145447/Express/0000/0/Express-8b6df553-8e83-4a9b-9249-c22cb316be00-0-logArchive.tar.gz . locally in a 12_6_3 release, and the program worked beyond that point of error. The error message looks like the storage.xml?protocol=eos would not contain the word protocol spelled correctly, so I'm puzzled what could cause it.

In a storage.json-using release the error message would also be very different.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 8, 2023

@makortel Matti, about the second error reported above, copied here again for clarity purposes:

Incomplete configuration. Valid site-local-config not found at /JobConfig/site-local-config.xml

I checked the job logs and we do have both old and new environment variables set (seen in the .out log file):

  CMS_PATH=/cvmfs/cms.cern.ch
  SITECONFIG_PATH=/cvmfs/cms.cern.ch/SITECONF/local

I also placed a full log tarball under this web directory:
https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11472/

Yes, it uses CMSSW_13_0_0_pre3 and this log belongs to a LogCollect job. Unfortunately, once again I find this logs poor and I cannot confirm what was the final setup at the worker node (in terms of CMSSW area).
Source code suggests that it was not changed though:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/LogCollect.py#L100-L115

@makortel
Copy link

makortel commented Feb 8, 2023

Thanks @amaltaro for the pointers. I poked around a bit in the WMCore code, and noticed that in Scram.__call__() the CMS_PATH environment variable is propagated explicitly to the sub-shell

if os.environ.get('CMS_PATH', None) is not None:
self.procWriter(proc, 'export CMS_PATH=%s\n' % os.environ['CMS_PATH'])

(among others).

Maybe SITECONFIG_PATH should be propagated in similar way? It is not clear to me why it would not be propagated from the parent process environment (as see in in the condor.*.out files), so I can only assume the explicit propagation of selected environment variables would be there for a reason.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 8, 2023

Haa, good catch Matti.

The code claims (from many many years ago) that something doesn't work well and hence variables need to be explicitly passed to the subprocess. I never really investigated this any deeper though. Anyhow, I am going to make further modifications to our code to pass SITECONF_PATH instead of CMS_PATH. Thanks

@germanfgv
Copy link
Contributor

germanfgv commented Feb 8, 2023

Is the error with 12_6_3? It is strange, because everything should work just as before with storage.xml.

Yes, it is 12_6_3.

I tried out the command edmCopyUtil -c trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos /store/unmerged/data/logs/prod/2023/2/7/Express_Run359473_StreamCalibration_Tier0_REPLAY_2023_ID230207145447_v7145447/Express/0000/0/Express-8b6df553-8e83-4a9b-9249-c22cb316be00-0-logArchive.tar.gz . locally in a 12_6_3 release, and the program worked beyond that point of error. The error message looks like the storage.xml?protocol=eos would not contain the word protocol spelled correctly, so I'm puzzled what could cause it.

In a storage.json-using release the error message would also be very different.

I'm also puzzled by this. I get the same results as you.

Right now this is the only problem we have with 12_6_3. We are processing the MWGR runs fine, but LogCollect is failling. @amaltaro do you see this qith 12_6_3 in central production?

@makortel
Copy link

makortel commented Feb 8, 2023

Ha, with the understanding of #11472 (comment) I was able to reproduce the error of #11472 (comment) with

$ cmsrel CMSSW_12_6_3
$ cd CMSSW_12_6_3/src
$ cmsenv
$ unset SITECONFIG_PATH
$ edmCopyUtil -c trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos /store/unmerged/data/logs/prod/2023/2/7/Express_Run359473_StreamCalibration_Tier0_REPLAY_2023_ID230207145447_v7145447/Express/0000/0/Express-8b6df553-8e83-4a9b-9249-c22cb316be00-0-logArchive.tar.gz .
cms::Exception caught in EdmFileUtil
An exception of category 'TrivialFileCatalog' occurred.
Exception Message:
TrivialFileCatalog::connect: protocol was not supplied in the contact string 

(although I still don't understand why exactly edmCopyUtil fails like this)

So I was wrong in

because everything should work just as before with storage.xml.

because before 12_6_0_pre2 the location of site-local-config.xml is determined from CMS_PATH, whereas in 12_6_0_pre2 and after (including 12_6_3) the location is determined from SITECONFIG_PATH.

@germanfgv How big problem is this for Tier-0? The CMS_PATH behavior could be restored with a relatively simple emergency patch, but would require a new (patch) release.

@makortel
Copy link

makortel commented Feb 8, 2023

(although I still don't understand why exactly edmCopyUtil fails like this)

Found out the reason. Seems that the SiteLocalConfigService, when a valid site-local-config.xml file is not found, and when asked for the (TFC) data catalogs, it returns a dummy file:PoolFileCatalog.xml value as the path to the storage.xml. Now this URL has no protocols specified, which leads to the confusing exception message.

Digging from history, I see an earlier behavior of throwing an exception in case of missing site-local-config.xml was changed in 2006 to return the dummy value as a "workaround is probably sufficient for 0_6_1". The Rucio catalog code path throws a more meaningful exception.

germanfgv added a commit to dmwm/T0 that referenced this issue Feb 9, 2023
Avoids CMSSW versions not compatible with Trivial File Catalog.
dmwm/WMCore#11472
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants