Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory problems when running Autosubmit experiments on ClimateDT VM #2122

Open
kinow opened this issue Feb 11, 2025 · 1 comment
Open

Memory problems when running Autosubmit experiments on ClimateDT VM #2122

kinow opened this issue Feb 11, 2025 · 1 comment
Labels
bug Something isn't working destine DestinE related discussion The issue is created to keep track a discussion under investigation This issue is currently under investigation or debugging in order to find the source of the problem
Milestone

Comments

@kinow
Copy link
Member

kinow commented Feb 11, 2025

I saw the messages on Mattermost about the memory consumption usage in the ClimateDT while I was running the testing suite, but I forgot to create the issue, apologies.

The oldest message I found on Mattermost from @tiggi is from February 3rd for an experiment from Jeisson (a20a), using AS 4.1.10 (replaced username).

USER         PID %CPU %MEM    VSZ   RSS     TTY      STAT START   TIME COMMAND
user      2545476  0.2    5.3   4583040 1758484 pts/7     Sl+    Feb02   1:42   autosubmit log a20a recovery marenostrum5
user      2545519  0.0    5.3   2944660 1752176 pts/7     Sl+    Feb02   0:30   autosubmit log a20a recovery marenostrum5-login
user      2545675  0.2    5.6   4741672 1832304 pts/8     Sl+    Feb02   1:46   autosubmit log a209 recovery marenostrum5
user      2545705  0.0    5.5   3037756 1821712 pts/8     Sl+    Feb02   0:30   autosubmit log a209 recovery marenostrum5-login

This is not an issue with only 4.1.10. On Thursday February 6th, Ulf reported the experiment a24r was also using a lot of memory. That experiment is from @franra9 and was using 4.1.12-dev-e326af. So it appears this issue is not only an AS 4.1.10 problem. Also, it appears Francesc had two processes running a24r. He probably had the #2112 bug, which causes one process to fail but stay hanging around sleeping (the three children are zombies until they timeout).

the two processes are now up to 5 gigs of resident memory in total.

Then it happened again today, in a209 with 4.1.10 (@LuiggiTenorioK reported in our Slack autosubmit-dev room).

Setting milestone 4.1.12 in case we happen to figure out what's going on quickly, and if that's a simple fix (e.g. maybe reading the YAML files too often/unnecessarily, etc.), but otherwise this can go to 4.1.13 (probably must go, to make sure we fix it while they run the d-suite).

@kinow kinow added bug Something isn't working destine DestinE related discussion The issue is created to keep track a discussion under investigation This issue is currently under investigation or debugging in order to find the source of the problem labels Feb 11, 2025
@kinow kinow added this to the 4.1.12 milestone Feb 11, 2025
@dbeltrankyl
Copy link
Contributor

dbeltrankyl commented Feb 11, 2025

It seems that is more of the memory not being cleaned well somehow; not sure if 4.1.12 case is related to Francesc having two experiments running and accessing the same stuff

At worst and as a temporal fix, what we can do is that the process is self-aware that it is consuming a lot of memory, suicide, and Autosubmit should spawn a new one. But I'll try to check this properly this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working destine DestinE related discussion The issue is created to keep track a discussion under investigation This issue is currently under investigation or debugging in order to find the source of the problem
Projects
None yet
Development

No branches or pull requests

2 participants