-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQM: Remove more DQMStore features #28247
Changes from all commits
8798dec
24e4f03
3ad05b8
2b915c0
0eaf103
df779ac
8776387
fbe46ff
b9eced0
3ccb946
0d81dab
aceab97
65dda95
fe8e852
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -98,7 +98,6 @@ | |
# 2 provide more detailed output | ||
Frequency = cms.untracked.int32(50), | ||
MEPathToSave = cms.untracked.string('AlCaReco/SiStripGainsAAG'), | ||
deleteAfterCopy = cms.untracked.bool(True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tocheng @christopheralanwest @mmusich please suggest how to test this ALCAPROMPT, using
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you change the ALCA producer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fabiocos, the command is correct but the input dataset is not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mmusich with which looks to me comparable to your test at the file crossing boundaries |
||
) | ||
|
||
# The actual sequence | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -99,7 +99,6 @@ | |
# 2 provide more detailed output | ||
Frequency = cms.untracked.int32(50), | ||
MEPathToSave = cms.untracked.string('AlCaReco/SiStripGains'), | ||
deleteAfterCopy = cms.untracked.bool(True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. on a different topic, is this supposed to be useless with the new DQM infrastructure? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cf: #14198 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not exactly sure what this workflow does, but in general the memory concerns at Tier0 are pretty much gone since we no longer use per-stream MEs. If the copy in this module (which hopefully does run in a HARVESTING step, and is not affected at all by the stream story) consumes too much memory, we need to find a different solution (like, book things in the right place, or read them from the right place, or rename things after the fact), but moving around stuff in the DQMStore is not something we can support for the future. Actually, the comment previous to the one you linked mentioned that this is a end-of-job spike, which from legends I know was one of the reasons to move towards the DQMIO format. I wonder if we should finally stop pretending to this code that it is still run 1 and move it to current infrastructure... Sadly, I have no real clue how these workflows work and it seems nobody else has either, so that will be hard. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am not sure how relevant it is, as to the best of my knowledge Tier0 AlCaSkim and AlCaHarvest jobs are single-threaded.
I would be happy to take a look to achieve this. Can you explain what you mean by "current" infrastructure? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "current": Not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so, I took the liberty of doing a bit of profiling.
and profiled the RSS memory: The spike at the end of the job is increased indeed, but I cannot comment if that is going to be a problem for Tier-0.
in order to be forced to produce a payload via: process.alcaSiStripGainsHarvester.GoodFracForTagProd = cms.untracked.double(0.0) #<- fraction of good fits (default is 98%)
process.alcaSiStripGainsHarvester.NClustersForTagProd = cms.untracked.double(100000) #<- total number of clusters (default used is 8e8) and the good news is that the payload hash is the same and I don't see differences in the harvested output (full diff here ): There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @mmusich for the test! The observation is, I'd say, exactly compatible with what is done here: During processing, the memory use is the same, but the spike while That the produced payload is identical is good to know. |
||
) | ||
|
||
# The actual sequence | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,6 @@ | |
# 2 provide more detailed output | ||
Frequency = cms.untracked.int32(50), | ||
MEPathToSave = cms.untracked.string('AlCaReco/SiStrip'), | ||
deleteAfterCopy = cms.untracked.bool(False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fabiocos, I am sorry to be picky, but I don't think
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mmusich as far as I can see in also the Tracer dump it do runs both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or your test was limited to this particular file change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mmusich I believe that we are clearly missing a test that is realistically mimicking what happens nowadays at the T0 , not bits and pieces. My understanding from a discussion with @tocheng is that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the additional tests. I agree about the lack of a suitable relval test testing all the AlCa aspects of Express. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
) | ||
|
||
seqALCARECOSiStripPCLHistos = cms.Sequence(MEtoEDMConvertSiStrip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tocheng @christopheralanwest @schneiml this can be tested using wf 1010.0 . In its step3 the log file is flooded with
Please comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating the test of @mmusich with wf 1010 step3 running 10000 events:
ecal.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the message see the comment I made in another location: the code that used to produce them is deleted in this PR.
Re the RSS graph: We see the usual MEtoEDM endJob spike, before as after the PR, doubling the memory use as all histograms are copied into a EDM product. Removing the delete-calls makes it an exact doubling, before it might have been a bit less, as some histos where deleted before saving.
This spike is why "normal" reco jobs switched to DQMIO saving in LS1 (which removes this spike alltogether), but for the ALCA workflows memory use was not considered a problem and they where never migrated. (Edit: This is just for context, this PR does not affect anything about the MEtoEDM procedure.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schneiml the message is in the default version, this PR removes it, sorry for the possible confusion.
Concerning the RSS plot I do not see anything worrying, the behaviour old vs new looks quite similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiocos re the message: ok, that is expected, there is nothing deleted any more and ofc there is no reason to warn about deleting non-existing things then.
This might explain why the memory use is effectively unaffected: The delete calls were not doing anything to begin with.