Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task timeSeriesSeaIceAreaVol appears to have trouble dealing with data beyond a certain length #981

Closed
wlin7 opened this issue Feb 4, 2024 · 6 comments · Fixed by #982
Closed
Assignees

Comments

@wlin7
Copy link

wlin7 commented Feb 4, 2024

mpas-analysis for v3.LR.piControl-spinup failed to complete if including the full time series that are longer than around 1350 years. The problem started with the generation of ts_0051-1400_climo_1351-1400, with the following error message

Unexpected status from in task timeSeriesSeaIceAreaVol.  This may be a bug.
Traceback (most recent call last):
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/bin/mpas_analysis", line 10, in <module>
    sys.exit(main())
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/mpas_analysis/__main__.py", line 1015, in main
    run_analysis(config, analyses)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/mpas_analysis/__main__.py", line 637, in run_analysis
    assert(runningProcessCount > 0)
AssertionError

The generation for mpas_analysis_ts_0001-1350_climo_1251-1350 was still fine. Shortening the length with ts_0101-1400_climo_1351-1400 was also fine. However, the same errors would occur for ts_0051-1400_climo_1351-1400.o465258 or ts_0101-1450_climo_1401-1450.o465657.

Full zppy logs the running mpas-analysis for these periods can be viewed at

/lcrc/group/e3sm2/ac.wlin/E3SMv3/20231209.v3.LR.piControl-spinup.chrysalis/post/scripts/mpas_analysis_ts_0001-1400_climo_1351-1400.o46465073
/lcrc/group/e3sm2/ac.wlin/E3SMv3/20231209.v3.LR.piControl-spinup.chrysalis/post/scripts/mpas_analysis_ts_0051-1400_climo_1351-1400.o465258
/lcrc/group/e3sm2/ac.wlin/E3SMv3/20231209.v3.LR.piControl-spinup.chrysalis/post/scripts/mpas_analysis_ts_0101-1450_climo_1401-1450.o465657
@xylar
Copy link
Collaborator

xylar commented Feb 4, 2024

@wlin7, thanks for reporting this. I'm sorry your having trouble with MPAS-Analysis. I will take a look tomorrow.

The error you're seeing may indeed indicate that a process was killed with an out-of-memory issue (perhaps not timeSeriesSeaIceAreaVol even though it got flagged). I will take a more careful look at the logs you pointed me to and find out what's going wrong.

@wlin7
Copy link
Author

wlin7 commented Feb 4, 2024

@xylar , thank you for looking into the problem. Make sense it is an out-of-memory issue., which could be during some later step of the task. The mpasTimeSeriesSeaIce.nc appears to be generated properly in terms of the record dimension size and the file size.

@xylar
Copy link
Collaborator

xylar commented Feb 5, 2024

@wlin7, I'm just now getting around to investigating this issue. I had to try to clean up some space because /lcrc/group/e3sm seemed to be completely full.

From the logs of individual tasks, it does look like timeSeriesSeaIceAreaVol is the only task that didn't complete so I'm trying to run that on its own to see if I can reproduce the problem in isolation. That will make it a lot easier to debug.

@xylar xylar self-assigned this Feb 5, 2024
@xylar
Copy link
Collaborator

xylar commented Feb 5, 2024

Yes, I can reproduce the problem. It is that we are trying to load a 59 GB data set and then operate on it. I will fix this as soon as I can, hopefully tomorrow. I will also hopefully have a test deployment of e3sm-unified 1.9.3rc1 for you to to see if the fix works by tomorrow. Today, I have 4 hours of meetings starting soon so I won't be able to make a lot of headway.

@wlin7
Copy link
Author

wlin7 commented Feb 5, 2024

@xylar , great you have isolated the problem. Thanks a lot. While we definitely need this fix, there is no hurry for that for the current piControl-spinup. I also use shorter time series (a latter anomaly reference year) to better display the trend in latest simulation period. So please take your time.

@xylar
Copy link
Collaborator

xylar commented Feb 5, 2024

Thanks @wlin7. It's helpful to know that this bug isn't keeping you from making immediate progress. That should give me time to fix the problem in a better way, rather than just trying to do it quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants