-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad sizes in Popularity #36349
Comments
A new Issue was created by @joseflix Josep Flix, PhD. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
I think the issue is related to info sent by CMSSW jobs via UDP to our collector. I suggest that somebody will check the information provided by jobs. |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Thanks @vkuznet for the clarification on where the information originates from. While addressing #34873 we realized that the file size reported in the UDP packages was set in a non-thread-safe way (#34873 (comment)) that could cause problems like this. The overhaul of |
Presumably these jobs are using older releases given that they are run 2 data. I agree it would be worth back porting this. (Not that backporitng it will solve all incorrect results, but would help)
… On Dec 3, 2021, at 7:21 PM, Matti Kortelainen ***@***.***> wrote:
Thanks @vkuznet for the clarification on where the information originates from.
While addressing #34873 we realized that the file size reported in the UDP packages was set in a non-thread-safe way (#34873 (comment)) that could cause problems like this.
The overhaul of StatisticsSenderService in #35505 should have fixed this issue as well. That PR was merged in 12_1_0_pre5. Would it be useful in earlier (active) release cycles as well?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
@perrotta asked in #36355 (comment)
New releases will be needed for these changes to take effect. The changes themselves wouldn't necessarily warrant new (patch) releases, but perhaps we should in order to get the updated code to be run by WM? |
Backports down to 9_4_X have been done. |
The backport PRs have been merged |
+1 |
This issue is fully signed and ready to be closed. |
Hi there, Sorry, but is this solved already? I still see values which are not consistent in the popularity kibana views, for example the snapshot I attach. In Popularity we don't have the CMSSW version used: So, I am not sure if this is because those jobs are using a CMSSW where the fix back porting was not applied? |
Old releases will not get this fix. So production/analyst work that has not moved forward to the fixed releases (done for all release cycles 2016+ iiuc) will still give bad values.
… On Mar 10, 2022, at 10:55 AM, Josep Flix, PhD ***@***.***> wrote:
Hi there, Sorry, but is this solved already? I still see values which are not consistent in the popularity kibana views, for example the snapshot I attach.
In Popularity we don't have the CMSSW version used:
So, I am not sure if this is because those jobs are using a CMSSW where the fix back porting was not applied?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you commented.
|
Hi @davidlange6 , thanks for the answer. How shall I know which release was using the jobs I report here? Maybe we should add the cmssw version on the popu records? |
Yea, I don't see anything in
https://github.com/cms-sw/cmssw/blob/master/Utilities/StorageFactory/src/StatisticsSenderService.cc
From which this can be derived unless there is something useful in the app_info field.
… On Mar 10, 2022, at 11:41 AM, Josep Flix, PhD ***@***.***> wrote:
Hi @davidlange6 , thanks for the answer. How shall I know which release was using the jobs I report here? Maybe we should add the cmssw version on the popu records?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you were mentioned.
|
For a limited number of cases DAS/DBS should tell the release a file/dataset was produced, so at least for the example case you should be able to check that. For longer term, adding CMSSW release field to the popularity information looks like straightforward to add. |
In DAS getting release info as simple as:
In DBS it is also trivial
and it yields:
(I need to check why we have three identical entries, most likely loose constrain on SQL query JOINS). |
You want the release used by the cmsRun job not the release used to create the input data set, no?
(But right, if you have the output dataset you could look it up that way)
… On Mar 10, 2022, at 4:52 PM, Matti Kortelainen ***@***.***> wrote:
For a limited number of cases DAS/DBS should tell the release a file/dataset was produced, so at least for the example case you should be able to check that. For longer term, adding CMSSW release field to the popularity information looks like straightforward to add.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you were mentioned.
|
Yes, I want the release used by cmsRun to open those files. We don't have any clue on the output generate files in popularity DB, if I am not mistaken, only input files... |
Ah, good point. Ok, we can add the CMSSW version in the popularity data. Unfortunately, as David mentioned above, that will be available only in new releases (of old release cycles). |
#37220 adds CMSSW version to the UDP packets. |
We have detected that sometimes in the monit_prod_cmssw_pop_* the files have associated a bad File Size. For example [1]:
The actual file size according to DAS [2] is 4191351315 (4.2GB). Sometimes there are accesses that show less bytes, always the same value. Which could be the reason for this?
This particular file is at PIC Tier-1 storage, and it has the correct value:
[root@dtn01 ~]# ls
-ltr /pnfs/pic.es/data/cms/disk/store/hidata/HIRun2018A/HIMinimumBias16/MINIAOD/PbPb18_MiniAODv1-v1/30000/413a6fdc-4752-41ee-b53e-f221f854ad5e.root
-rw-r--r-- 1 cms001 cms 4191351315 Apr 5 2021 /pnfs/pic.es/data/cms/disk/store/hidata/HIRun2018A/HIMinimumBias16/MINIAOD/PbPb18_MiniAODv1-v1/30000/413a6fdc-4752-41ee-b53e-f221f854ad5e.root
But, we see this happens for many other files...
Valentin pointed to me that we should open a ticket here. This is part of the CMSSW report on popularity, so this might need a deeper look.
[1] https://monit-kibana.cern.ch/kibana/app/kibana#/discover?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now%2Fy,to:now%2Fy))&_a=(columns:!(data.file_lfn,data.file_size,data.read_bytes,data.client_host),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'79cbcb40-e78f-11ea-966a-e1c0a7950cea',key:data.file_lfn,negate:!f,params:(query:%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root),type:phrase,value:%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root),query:(match:(data.file_lfn:(query:%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root,type:phrase))))),index:'79cbcb40-e78f-11ea-966a-e1c0a7950cea',interval:auto,query:(language:kuery,query:''),sort:!(metadata.timestamp,desc))
[2] https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=%2Fstore%2Fhidata%2FHIRun2018A%2FHIMinimumBias16%2FMINIAOD%2FPbPb18_MiniAODv1-v1%2F30000%2F413a6fdc-4752-41ee-b53e-f221f854ad5e.root
The text was updated successfully, but these errors were encountered: