-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chirp-based HTCondor updates for 7_5_X #10420
Chirp-based HTCondor updates for 7_5_X #10420
Conversation
This commit provides a new default service, CondorStatusService, which automatically reports basic progress statistics (# events, The service will automatically detect if it's running as part of a HTCondor job and probes to see if HTCondor has user-level updates enabled (and is a sufficient version). If either check fails, then the service does not register any callbacks with the framework and effectively becomes a no-op. To see if it is within a HTCondor job, the service looks for the _CONDOR_CHIRP_CONFIG environment variable - a very cheap check. To see if HTCondor supports this feature, it spawns a new process and looks at the exit code (more expensive). This uses the 'set_job_attr_delayed' mechanism of HTCondor, which causes these updates to 'tag along' with the existing updates for memory / disk / CPU usage. Hence, the extra cost is the few extra bytes in the update packet that goes out once every 5 minutes. The update only goes as far as the local daemon; condor_chirp does not wait for it to propagate to the remote host. Hence, it exits rapidly. CMSSW will only update once every updateIntervalSeconds (defaults to 15 minutes). The HTCondor worker node and central components all have additional rate limiting mechanisms to prevent overload. (cherry picked from commit 17a9f62)
This commit adds a few more reported attributes: - ChirpCMSSWFinished: Unix timestamp of when the job has finished. - ChirpCMSSWLastUpdate: Unix timestamp of when the last update occurred. - ChirpCMSSWMaxEvents: Maximum number of input events CMSSW is configured to process. From the process's maxEvents pset. - ChirpCMSSWMaxLumis: Maximum number of lumis CMSSW is configured to process. From process's maxLuminosityBlocks pset and some simple processing of the source. If no max is configured, the attribute is not reported. The motivation behind these attributes is they: - Simplify deadlock detection. Using the defaults, the LastUpdate attribute should never be more than 30 minutes if Finished isn't set. - Provide simple estimates of percent completion. When we can determine the number of events or lumis to process (something we can for the majority of the grid use cases), we'll be able to determine the number of events/lumis left to process and an aggregate event/lumi processing rate. (cherry picked from commit ddf7626)
(cherry picked from commit dfb92ed)
If a single HTCondor job has multiple CMSSW processes run sequentially (i.e., GEN-SIM step followed by DIGI-RECO in the same job), then attributes set at the end of one process will still be around when the next one starts up. Hence, we overwrite all attributes at startup, even if we don't have meaningful values yet (we don't know the maxEvents until BeginJob). Notice we set ChirpCMSSWDone last; this is because the attribute setting is ordered but not atomic. Hence, we can write policies and prefix them with: (ChirpCMSSWDone=!=true) && ... and know that any further attribute references will belong to the current running job. (cherry picked from commit afb5fd0)
(cherry picked from commit a23fc31)
(cherry picked from commit 044401f)
(cherry picked from commit cdf7513)
(cherry picked from commit acd0eea)
A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_7_5_X. Chirp-based HTCondor updates for 7_5_X It involves the following packages: FWCore/Framework @cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks. |
please test |
The tests are being triggered in jenkins. |
Comparison is ready The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons |
Chirp-based HTCondor updates for 7_5_X
Backport of #10056 for CMSSW_7_5_X.