-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chirp-based HTCondor updates for 7_4_X #10419
Chirp-based HTCondor updates for 7_4_X #10419
Conversation
This commit provides a new default service, CondorStatusService, which automatically reports basic progress statistics (# events, The service will automatically detect if it's running as part of a HTCondor job and probes to see if HTCondor has user-level updates enabled (and is a sufficient version). If either check fails, then the service does not register any callbacks with the framework and effectively becomes a no-op. To see if it is within a HTCondor job, the service looks for the _CONDOR_CHIRP_CONFIG environment variable - a very cheap check. To see if HTCondor supports this feature, it spawns a new process and looks at the exit code (more expensive). This uses the 'set_job_attr_delayed' mechanism of HTCondor, which causes these updates to 'tag along' with the existing updates for memory / disk / CPU usage. Hence, the extra cost is the few extra bytes in the update packet that goes out once every 5 minutes. The update only goes as far as the local daemon; condor_chirp does not wait for it to propagate to the remote host. Hence, it exits rapidly. CMSSW will only update once every updateIntervalSeconds (defaults to 15 minutes). The HTCondor worker node and central components all have additional rate limiting mechanisms to prevent overload.
This commit adds a few more reported attributes: - ChirpCMSSWFinished: Unix timestamp of when the job has finished. - ChirpCMSSWLastUpdate: Unix timestamp of when the last update occurred. - ChirpCMSSWMaxEvents: Maximum number of input events CMSSW is configured to process. From the process's maxEvents pset. - ChirpCMSSWMaxLumis: Maximum number of lumis CMSSW is configured to process. From process's maxLuminosityBlocks pset and some simple processing of the source. If no max is configured, the attribute is not reported. The motivation behind these attributes is they: - Simplify deadlock detection. Using the defaults, the LastUpdate attribute should never be more than 30 minutes if Finished isn't set. - Provide simple estimates of percent completion. When we can determine the number of events or lumis to process (something we can for the majority of the grid use cases), we'll be able to determine the number of events/lumis left to process and an aggregate event/lumi processing rate.
If a single HTCondor job has multiple CMSSW processes run sequentially (i.e., GEN-SIM step followed by DIGI-RECO in the same job), then attributes set at the end of one process will still be around when the next one starts up. Hence, we overwrite all attributes at startup, even if we don't have meaningful values yet (we don't know the maxEvents until BeginJob). Notice we set ChirpCMSSWDone last; this is because the attribute setting is ordered but not atomic. Hence, we can write policies and prefix them with: (ChirpCMSSWDone=!=true) && ... and know that any further attribute references will belong to the current running job.
A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_7_4_X. Chirp cmssw updates 74x backport It involves the following packages: FWCore/Framework @cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks. |
please test |
The tests are being triggered in jenkins. |
@davidlange6 - what are your thoughts on this one? I'd like to get testing rolling (and Dirk suggested it would be easiest to test at the T0 on the 7_4_X branch). |
Chirp-based HTCondor updates for 7_4_X
Backport of #10056 CMSSW_7_4_X.