-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring gen weight storage in EDM + Nano integration #32167
Conversation
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32167/19876 ERROR: Build errors found during clang-tidy run.
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32167/19879 ERROR: Build errors found during clang-tidy run.
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32167/19884
|
A new Pull Request was created by @kdlong (Kenneth Long) for master. It involves the following packages: DataFormats/NanoAOD @SiewYan, @perrotta, @gouskos, @mkirsano, @cmsbuild, @jpata, @fgolf, @slava77, @alberto-sanchez, @agrohsje, @mariadalfonso, @santocch, @GurpreetSinghChahal can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins.
|
-1 Tested at: 45323bb CMSSW: CMSSW_11_2_X_2020-11-17-2300 I found follow errors while testing this PR Failed tests: HeaderConsistency ClangBuild
I found compilation warning while trying to compile with clang. Command used:
See details on the summary page. |
Comparison job queued. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kdlong, I was looking at this again for private use (since the gen weight situation in UL is even messier than pre-L...) and noticed a few bugs and typos. Just leaving these notes here for whenever you have time to get back to it.
{ErrorType::SwapHeader, "Header info out of order"}, | ||
{ErrorType::HTMLStyle, "Header is invalid HTML"}, | ||
{ErrorType::TrailingStr, "Header has extraneous info"}, | ||
{ErrorType::Unknown, "Unregonized error"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Unregonized -> Unrecognized
if (debug_) | ||
std::cout << "Weight type is unknown\n"; | ||
|
||
std::cout << "Group name is " << weight.groupname << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be behind the debug_
flag?
std::string labels = "PS weights (w_var / w_nominal); "; | ||
if (pswV.isWellFormed()) { | ||
double baseline = psWeights.at(pswV.weightIndexFromLabel("Baseline")); | ||
psTosave.emplace_back(psWeights.at(pswV.variationIndex(true, false, gen::PSVarType::def)) / baseline); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this line and the next variationIndex()
call are incorrect according to
int isrCombinedDownIndex(PSVarType variationType) const { return variationIndex(true, false, variationType); } |
true, false
is isrCombinedDownIndex()
and false, false
is fsrCombinedDownIndex()
). instead, isr/fsrCombinedUpIndex()
should be used on these lines.
desc.addUntracked<bool>("saveProvenance", true) | ||
->setComment("Save process provenance information, e.g. for edmProvDump"); | ||
desc.addUntracked<bool>("fakeNameForCrab", false) | ||
->setComment( | ||
"Change the OutputModule name in the fwk job report to fake PoolOutputModule. This is needed to run on cran " | ||
"Change the OutputModule name in the fwk job report to fake " | ||
"PoolOutputModule. This is needed to run on cran " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: cran -> crab
void WeightHelper::cleanupOrphanCentralWeight(WeightGroupInfoContainer& weightGroups) const { | ||
auto centralIt = std::find_if(std::begin(weightGroups), std::end(weightGroups), [](auto& entry) { | ||
return entry->weightType() == gen::WeightType::kScaleWeights && | ||
static_cast<ScaleWeightGroupInfo*>(entry.get())->containsCentralWeight(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic here is incorrect: this should select for !containsCentralWeight()
, in order to find the scale weight group that is missing the central weight (because it was orphaned). (I had to change this when testing privately, or the resulting scale weight group did not qualify as well formed.)
int close = -1; | ||
for (size_t idx = 0; idx < headerLines.size(); idx++) { | ||
std::string& line = headerLines[idx]; | ||
std::cout << "Line is " << line << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be behind debug_
flag
-reconstruction cleaning reco queue |
Milestone for this pull request has been moved to CMSSW_14_0_X.Please open a backport if it should also go in to CMSSW_13_3_X. |
please close |
@@ -299,41 +327,42 @@ void NanoAODOutputModule::openFile(edm::FileBlock const&) { | |||
m_file->SetCompressionAlgorithm(ROOT::kZLIB); | |||
} else if (m_compressionAlgorithm == std::string("LZMA")) { | |||
m_file->SetCompressionAlgorithm(ROOT::kLZMA); | |||
} else if (m_compressionAlgorithm == std::string("ZSTD")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an artifact of some merge history issue or do we want to remove these compression algos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That must be a merge mistake, thanks for spotting!
PR description:
This PR introduces new weight products to hold the Meta Info of the weights, and the weights themselves, organized into groups of a given type (Scale, Pdf, MEParam, etc.)
This has been presented several times in GEN and X-POG, such as https://indico.cern.ch/event/944653/#5-status-of-new-data-structure
The structure of the products is in place. We are still finalizing the NanoAOD integration and testing for parsing corner cases. We expect these tests to be wrapped up in a couple of weeks, but because of the scale of the changes, it would be useful to have feedback already (I'm sure there will be lost of coding suggestions).
This builds on the work of several people, including @dteague @sroychow @guitargeek @cericeci
@agrohsje and @kpedro88 have contributed to the design and review.
PR validation:
We have validated MiniAOD --> NanoAOD, adding products to the event, and GEN-->NanoAOD instantiating products to files. A large testing campaign will take place before NanoAOD v9 and we will report the status here.
NanoAOD and MiniAOD file type size changes
The size of the Nano will in general not be changed, by default the same number of PDF and scale weights are stored. The main change is to make the selection of weights to be stored much more convenient, significantly reduce the number of samples where the weights are not found and not stored (makes those individual samples larger but not a big impact on the average), and provides features for keeping more weights in private production.
The change in size of the MiniAOD will depend on the strategy we adopt, which I'd like a bit more feedback on. If we duplicate the event weight information, it is a 10% increase. I assume this is not under consideration. If we zero the weights in the existing product (produce a new LHEEventProduct with only the central weight for example), the change will be negligible. If we only ever produce the product on the fly the size change will be 0.
Perhaps a good option would be to keep the duplicate product at the gen data tier, and zero it out at the MiniAOD step. This requires fixing code that depends on the weights in the LHEEventProduct (Rivet, for example, https://github.com/cms-sw/cmssw/blob/master/GeneratorInterface/RivetInterface/plugins/RivetAnalyzer.cc, but the change is straightforward) and in updating the name of the LHEEventProduct anywhere it is read if it is produced by a new producer (copy previous with zeroed weights). I guess this could be tedious.