Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chirp-based HTCondor updates for 7_4_X #10419

Merged

Conversation

bbockelm
Copy link
Contributor

Backport of #10056 CMSSW_7_4_X.

This commit provides a new default service, CondorStatusService,
which automatically reports basic progress statistics (# events,

The service will automatically detect if it's running as part of
a HTCondor job and probes to see if HTCondor has user-level updates
enabled (and is a sufficient version).  If either check fails, then
the service does not register any callbacks with the framework and
effectively becomes a no-op.

To see if it is within a HTCondor job, the service looks for the
_CONDOR_CHIRP_CONFIG environment variable - a very cheap check.
To see if HTCondor supports this feature, it spawns a new process
and looks at the exit code (more expensive).

This uses the 'set_job_attr_delayed' mechanism of HTCondor,
which causes these updates to 'tag along' with the existing updates
for memory / disk / CPU usage.  Hence, the extra cost is the few
extra bytes in the update packet that goes out once every 5 minutes.
The update only goes as far as the local daemon; condor_chirp does
not wait for it to propagate to the remote host.  Hence, it exits
rapidly.

CMSSW will only update once every updateIntervalSeconds (defaults
to 15 minutes).  The HTCondor worker node and central components
all have additional rate limiting mechanisms to prevent overload.
This commit adds a few more reported attributes:

- ChirpCMSSWFinished: Unix timestamp of when the job has finished.
- ChirpCMSSWLastUpdate: Unix timestamp of when the last update occurred.
- ChirpCMSSWMaxEvents: Maximum number of input events CMSSW is
  configured to process.  From the process's maxEvents pset.
- ChirpCMSSWMaxLumis: Maximum number of lumis CMSSW is configured
  to process.  From process's maxLuminosityBlocks pset and some
  simple processing of the source.

If no max is configured, the attribute is not reported.

The motivation behind these attributes is they:
- Simplify deadlock detection.  Using the defaults, the LastUpdate
  attribute should never be more than 30 minutes if Finished isn't set.
- Provide simple estimates of percent completion.  When we can determine
  the number of events or lumis to process (something we can for the
  majority of the grid use cases), we'll be able to determine the
  number of events/lumis left to process and an aggregate event/lumi
  processing rate.
If a single HTCondor job has multiple CMSSW processes run sequentially
(i.e., GEN-SIM step followed by DIGI-RECO in the same job), then
attributes set at the end of one process will still be around when
the next one starts up.  Hence, we overwrite all attributes at
startup, even if we don't have meaningful values yet (we don't know
the maxEvents until BeginJob).

Notice we set ChirpCMSSWDone last; this is because the attribute
setting is ordered but not atomic.  Hence, we can write policies
and prefix them with:

(ChirpCMSSWDone=!=true) && ...

and know that any further attribute references will belong to the
current running job.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_7_4_X.

Chirp cmssw updates 74x backport

It involves the following packages:

FWCore/Framework
FWCore/Services

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit, @wmtan this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' in the first line of a comment.
@Degano you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@Dr15Jones
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

@bbockelm
Copy link
Contributor Author

@davidlange6 - what are your thoughts on this one? I'd like to get testing rolling (and Dirk suggested it would be easiest to test at the T0 on the 7_4_X branch).

davidlange6 added a commit that referenced this pull request Aug 1, 2015
@davidlange6 davidlange6 merged commit 339f484 into cms-sw:CMSSW_7_4_X Aug 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants