Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2018 replay for PPS PCL test #4640

Closed
wants to merge 4 commits into from
Closed

2018 replay for PPS PCL test #4640

wants to merge 4 commits into from

Conversation

tvami
Copy link
Contributor

@tvami tvami commented Dec 17, 2021

Replay Request

Requestor

AlCaDB

Describe the configuration

Purpose of the test

To test the newly introduced PPS PCL workflows. PPS data is not available in 2021, so we need to run on the 2018 data.

T0 Operations HyperNews thread

https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2326.html

@tvami
Copy link
Contributor Author

tvami commented Dec 17, 2021

run replay please

@cmsdmwmbot
Copy link

Replay testing PR '2018 replay for PPS PCL test'
An automatic replay has been requested by tvami.
Here is a brief description of the replay.
Deployment ID: 211217151408
Github PR: #4640
PR author: tvami
Requestor: AlCaDB
Injected runs: 324841
CMSSW release: CMSSW_12_2_0_pre3
Tier0 release: 3.0.1
ppScenario: ppEra_Run2_2018
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/385/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-712

@cmsdmwmbot
Copy link

There are 309 filesets not closed.
There is 0 paused job in the replay.

@cmsdmwmbot
Copy link

There are 17 repack workflows.
There are 5 express workflows.
There are 877 filesets not closed.
There is 1 paused job in the replay.

@cmsdmwmbot
Copy link

There are 15 repack workflows.
There are 5 express workflows.
There are 2321 filesets not closed.
There are 1270 paused jobs in the replay.

@cmsdmwmbot
Copy link

There are 7 repack workflows.
There are 3 express workflows.
There are 2960 filesets not closed.
There are 1272 paused jobs in the replay.

@cmsdmwmbot
Copy link

There are 7 repack workflows.
There are 2 express workflows.
There are 4553 filesets not closed.
There are 6 paused jobs in the replay.

@cmsdmwmbot
Copy link

Replay testing PR '2018 replay for PPS PCL test'
An automatic replay has been requested by tvami.
Here is a brief description of the replay.
Deployment ID: 211217151408
Github PR: #4640
PR author: tvami
Requestor: AlCaDB
Injected runs: 324841
CMSSW release: CMSSW_12_2_0_pre3
Tier0 release: 3.0.1
ppScenario: ppEra_Run2_2018
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/386/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-713

@cmsdmwmbot
Copy link

There are 2971 filesets not closed.
There are 3 paused jobs in the replay.(Memory error count: {})

@tvami
Copy link
Contributor Author

tvami commented Dec 21, 2021

I'm closing this PR, with the comment that I made an issue here
cms-AlCaDB/AlCaTools#53
as a reminder that there is action needed from the PPS side for this to converge.

I'll reopen the PR in January when we have the 12_2_1 out.

@tvami tvami closed this Dec 21, 2021
@tvami tvami reopened this Jan 24, 2022
@tvami
Copy link
Contributor Author

tvami commented Jan 24, 2022

I'm reopening this, as the new CMSSW is expected to be out very soon

@cmsdmwmbot
Copy link

Replay testing PR '2018 replay for PPS PCL test'
An automatic replay has been requested by tvami.
Here is a brief description of the replay.
Deployment ID: 220124191653
Github PR: #4640
PR author: tvami
Requestor: AlCaDB
Injected runs: 324841
CMSSW release: CMSSW_12_2_0_pre3
Tier0 release: 3.0.1
ppScenario: ppEra_Run2_2018
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/394/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-716

@cmsdmwmbot
Copy link

There are 313 filesets not closed.
There is 0 paused job in the replay.

@tvami
Copy link
Contributor Author

tvami commented Jan 24, 2022

Hi @jhonatanamado @GermanGiraldo I didnt actually mean to trigger tests yet, only in a few days when 12_2_1 is out. Please have them aborted. Thanks!

@cmsdmwmbot
Copy link

There are 17 repack workflows.
There are 5 express workflows.
There are 827 filesets not closed.
There is 1 paused job in the replay.

@jhonatanamado
Copy link
Contributor

hi @tvami , thanks for let us know, we will fail all the workflows for this replay and as soon CMSSW_12_2_0_pre3 is ready you can trigger it again.

@tvami
Copy link
Contributor Author

tvami commented Jan 25, 2022

New CMSSW is out, I'll submit a commit in a few minutes

@tvami
Copy link
Contributor Author

tvami commented Jan 25, 2022

run replay please

1 similar comment
@germanfgv
Copy link
Contributor

run replay please

@cmsdmwmbot
Copy link

Replay testing PR '2018 replay for PPS PCL test'
An automatic replay has been requested by germanfgv.
Here is a brief description of the replay.
Deployment ID: 220126071046
Github PR: #4640
PR author: tvami
Requestor: AlCaDB
Injected runs: 324841
CMSSW release: CMSSW_12_2_0_patch1
Tier0 release: 3.0.1
ppScenario: ppEra_Run2_2018
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/399/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-718

@cmsdmwmbot
Copy link

There are 315 filesets not closed.
There is 0 paused job in the replay.

@cmsdmwmbot
Copy link

There are 17 repack workflows.
There are 5 express workflows.
There are 854 filesets not closed.
There are 693 paused jobs in the replay.

@germanfgv
Copy link
Contributor

germanfgv commented Jan 27, 2022

All jobs from StreamExpress workflow are paused due to the same issue

2022-01-26 08:15:54,557:INFO:Scram:Subprocess stdout was:
b'removing ENDJOB from steps since not compatible with DQMIO dataTier\
entry tobeoverwritten.xyz\
Step: RAW2DIGI Spec: \
Step: L1Reco Spec: \
Step: RECO Spec: \
Step: EI Spec: \
Step: ALCAPRODUCER Spec: ['SiStripPCLHistos', 'SiStripCalZeroBias', 'SiStripCalMinBias', 'SiStripCalMinBiasAAG', 'TkAlMinBias', 'LumiPixelsMinBias', 'SiPixelCalZeroBias', 'PPSCalTrackBasedSel', 'PPSTimingCalib', 'PPSAlignment']\
The following alcas could not be found ['PPSTimingCalib', 'PPSAlignment']\
available  ['TkAlMinBias', 'TkAlMuonIsolated', 'TkAlMuonIsolatedPA', 'TkAlZMuMu', 'TkAlDiMuonAndVertex', 'TkAlZMuMuPA', 'TkAlJpsiMuMu', 'TkAlUpsilonMuMu', 'TkAlUpsilonMuMuPA', 'SiPixelCalSingleMuon', 'SiPixelCalSingleMuonLoose', 'SiPixelCalSingleMuonTight', 'SiPixelCalCosmics', 'SiStripCalMinBias', 'SiStripCalSmallBiasScan', 'SiStripCalMinBiasAAG', 'SiStripCalCosmics', 'SiStripCalCosmicsNano', 'SiStripCalZeroBias', 'SiPixelCalZeroBias', 'LumiPixelsMinBias', 'AlCaPCCZeroBiasFromRECO', 'AlCaPCCRandomFromRECO', 'PPSCalTrackBasedSel', 'EcalCalZElectron', 'EcalCalWElectron', 'EcalUncalZElectron', 'EcalUncalWElectron', 'EcalESAlign', 'EcalTrg', 'EcalTestPulsesRaw', 'HcalCalDijets', 'HcalCalGammaJet', 'HcalCalHO', 'HcalCalHOCosmics', 'HcalCalIsoTrk', 'HcalCalIsoTrkFilter', 'HcalCalIsoTrkFilterNoHLT', 'HcalCalIsoTrkProducerFilter', 'HcalCalNoise', 'HcalCalIterativePhiSym', 'HcalCalIsolatedBunchFilter', 'HcalCalIsolatedBunchSelector', 'HcalCalHBHEMuonFilter', 'HcalCalHBHEMuonProducerFilter', 'HcalCalLowPUHBHEMuonFilter', 'HcalCalHEMuonFilter', 'HcalCalHEMuonProducerFilter', 'MuAlCalIsolatedMu', 'MuAlZMuMu', 'MuAlOverlaps', 'RpcCalHLT', 'TkAlCosmicsInCollisions', 'TkAlCosmics', 'TkAlCosmicsHLT', 'TkAlCosmics0T', 'TkAlCosmics0THLT', 'MuAlGlobalCosmics', 'MuAlGlobalCosmicsInCollisions', 'TkAlBeamHalo', 'MuAlBeamHalo', 'MuAlBeamHaloOverlaps', 'TkAlLAS', 'PromptCalibProdPPSTimingCalib', 'PromptCalibProdPPSDiamondSampicTimingCalib', 'PromptCalibProdPPSAlignment', 'PromptCalibProd', 'PromptCalibProdBeamSpotHP', 'PromptCalibProdBeamSpotHPLowPU', 'PromptCalibProdSiStrip', 'PromptCalibProdSiPixel', 'PromptCalibProdSiStripGains', 'PromptCalibProdSiStripGainsAAG', 'PromptCalibProdSiPixelLorentzAngle', 'PromptCalibProdSiPixelAli', 'SiStripPCLHistos', 'PromptCalibProdEcalPedestals', 'PromptCalibProdLumiPCC', 'Hotline', 'EcalCalEtaCalib', 'EcalCalPi0Calib', 'HcalCalMinBias', 'HcalCalPedestal', 'LumiPixels', 'AlCaPCCZeroBias', 'AlCaPCCRandom', 'RawPCCProducer']\
Failed to load process from Scenario ppEra_Run2_2018 (<Configuration.DataProcessing.Impl.ppEra_Run2_2018.ppEra_Run2_2018 object at 0x2ac484126580>).\

As can be seen, the issue is finding producers PPSTimingCalib' and 'PPSAlignment.

Logs can be found here:
/afs/cern.ch/user/c/cmst0/public/PausedJobs/PPSPCL/job_5201

@germanfgv
Copy link
Contributor

@tamas After your fixes, we hit a similar error. Now the system is unable to find PromptCalibProdPPS, PromptCalibProdPPSAlig:

The following alcas could not be found ['PromptCalibProdPPS', 'PromptCalibProdPPSAlig']

@tvami
Copy link
Contributor Author

tvami commented Jan 31, 2022

@germanfgv ok I removed those, should we trigger tests again?

@germanfgv
Copy link
Contributor

All but one of the jobs were successful. The failing job is from workflow StreamExpressAlignment. The problems is increased memory consumption for a 1 core Express job, as can be seen here:

2022-02-01 03:58:25,423:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 1
Job has exceeded maxPSS: 2355.2 MB
Job has PSS: 2553 MB

The tarbal of the job can be found here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/PPSPCL/job_29139/tarball

@francescobrivio
Copy link
Contributor

All but one of the jobs were successful. The failing job is from workflow StreamExpressAlignment. The problems is increased memory consumption for a 1 core Express job, as can be seen here:

2022-02-01 03:58:25,423:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 1
Job has exceeded maxPSS: 2355.2 MB
Job has PSS: 2553 MB

The tarbal of the job can be found here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/PPSPCL/job_29139/tarball

Hi @germanfgv thanks for reporting this!
Strange that the issue comes from StreamExpressAlignment since the producers for that stream were not touched in this PR. Could you specify if the failing job comes from TkAlMinBias or PromptCalibProdBeamSpotHP?

@tvami
Copy link
Contributor Author

tvami commented Feb 2, 2022

Since the replay finished successfully for the PPS replay, I'm closing this PR!

@boudoul
Copy link

boudoul commented Feb 8, 2022

Hello I was wondering whether the error reported above by @germanfgv and commented by @francescobrivio is understood ? Thanks

@malbouis
Copy link
Contributor

malbouis commented Feb 9, 2022

https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2326.html

Hi, since this stream with the error was not the main goal of this replay and runs without any problems in other replays, we did not give it higher priority in investigating the failure. But we have just inquired with @germanfgv to learn which is producer that is raising the error so we can investigate it further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants