Incorrect thresholds printed from stat_analysis #1506
Replies: 5 comments 3 replies
-
Hi John, And thank you for your question. Unfortunately, an issue like the one you've indicated can be very elusive: an unrepeatable problem is a problem that needs to be tracked down. While this issue may have a codified solution in MET, the time and cost of finding out exactly what it is is prohibitive to taking any steps at the moment. It's very possible that it is being caused by the taskload, but we can't say it with certainty. If this problem becomes a consistent, repeatable issue we will take a look at the environment and run commands and try to find a solution. In the meantime, I would suggest that if you find a lower task number doesn't cause these issues to continue at that lower task load. |
Beta Was this translation helpful? Give feedback.
-
@johnlwagner I did talk with @j-opatz about this topic and am obviously concerned. I’ve been thinking about it for a few days but haven’t come up with a great way of approaching the problem yet. Some potential problems that might manifest themselves in this way are uninitialized memory and array overflow. But I don’t have great confidence in that “diagnosis”. It’s also possible that MET 10.1.0 will fix this behavior, released yesterday. But that may also just be wishful thinking. At this point, I’d like to grab some of the data you’re using as input to Stat-Analysis along with the job or config file you’re running. Can you please point me to those details on WCOSS? I can recompile MET locally with some flags to tell it to initialize memory to weird values. That sometimes helps make memory issues more repeatable. |
Beta Was this translation helpful? Give feedback.
-
Sorry for the delay on this. We continue to reprocess our data, but we
have not been able to repeat the error yet.
I did set up a test case to show how we are running stat_analysis. On
venus (WCOSS), please look in
/gpfs/dell3/mdl/mdlverif/noscrub/usr/John.L.Wagner/mdl-verification/MET_test_case.
In there, you will find a test script, a month's worth of data, the
stat_analysis config file, and the output of stat_analysis.
If you need any additional information, please let me know.
Thanks
John
|
Beta Was this translation helpful? Give feedback.
-
Thanks John. I would appreciate your help in going over our stat-analysis
config files. Let me know what days/times work for you. My calendar is
mostly open Thursday-Friday this week and Tuesday-Friday next week.
The first issue that you reported below I don't foresee being an issue in
our normal processing. We copy the data to a run-specific directory in
ptmp on WCOSS, so there would not be previous runs that we need to worry
about. We will start testing the rest of your suggestions this week.
|
Beta Was this translation helpful? Give feedback.
-
Yes, we can mark this as answered. Thanks for your help!
|
Beta Was this translation helpful? Give feedback.
-
Greetings
We have had a handful of cases now where the FCST_THRESH printed in the stat output file from stat_analysis is incorrect. We're running stat_analysis on WCOSS (venus in this case) using MET V10.0 from /gpfs/dell2/emc/verification/noscrub/emc.metplus/modulefiles. This problem has occurred when running stat_analysis to aggregate station scores for elements with multiple thresholds (qpf06 and sky cover). In the qpf06 example below, the top FCST_THRESH printed is >=6.35, when >=0.254 should have been printed, as it was for the 2 lines below it.
Rerunning this case seems to solve the issue, so it doesn't seem to be an issue with stat_analysis itself (at least its not a repeatable problem). I'm wondering if the issue is related to the resources given to stat_analysis. We run in the dev2 queue, typically with 40 tasks running per node.
Has anyone else running on WCOSS run into similar issues? If so, do you have any recommendations for setting the resources on a node? Have you run with fewer than 40 tasks in order to make more memory available for each task? Any recommendations would be appreciated.
Thanks
John
Beta Was this translation helpful? Give feedback.
All reactions