-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out #3404
Comments
So just to be sure, you are using the standard Are all the files you expect present in the |
Yes, I'm using the default slurm plugin. All files in the retrieved node are present, but they contain only partial output. When I use verdi calcjob gotocomputer to check the remote, there the files contain full output, however, the retrieved files are not complete since they have been retrieved before the calculation has finished. Here is the output of verdi node attributes command:
|
Is there anything peculiar in the content of the scheduler output files? Does this happen for all calculations of all types? Or only of this type? Or even only for some calculations of this type? It would be good if you have no calculations running anymore to set |
The scheduler output files are empty. We have seen this happening with one calculation type only and with one remote computer. However, this doesn't have to mean anything since vast majority of calculations we have done so far were with this calculation and remote computer. It seems to be happening randomly with most calculations having no issue. If I repeat exactly the same calculation where the problem occurred it will usually finish fine. I will try to turn on the debugging and will post the output once the problem occurs. |
I have managed to reproduce the issue again, so I can provide some more info. The issue happens also with different remote computer, with the pbspro scheduler and with a different calculation plugin. We used a slightly modified version of the pbspro scheduler, but I don't think the issue is related to the scheduler since the issue seems to be exactly the same with the slurm scheduler: Aiida at some points starts to retrieve the calculation even though the calculation is still running or in the queue. I was unable to reproduce the issue at first, it seems to be quite random, however, now I've run hundreds of calculations and can see the issue in some. Because of this, the whole log file is massive, so I'm uploading only parts relevant to some of the calculation where the issue occurred. I can provide the whole file if necessary, but it has over 1GB. Even the cut parts are still quite large so I put them in dropbox. The files are named according to the process pk for which the process occurred. I often see the following error in
but it seems to be unrelated since it also happens with calculations that finish fine, and not all calculations that have the issue have this error. |
Are there some things I could try to fix this issue or to investigate more? It seems that we only see this issue on one particular computer (though it's hard to be sure because of the randomness), which is running on Debian 10, whereas otherwise we have been using computers running Ubuntu. Is it possible that the issue is related to the operating system, or to version of some package like RabbitMQ? I can try to experiment with this, but it's quite time consuming, because it's necessary to run a lot of calculations to verify whether the issue is there. |
Sorry for not responding earlier @Zeleznyj I have been swamped with the release of |
@danieleongari at LSMO seems to run into the same issue, i.e. this seems confirmed. |
@Zeleznyj thanks for your report.
Just to be absolutely sure (since this is important): You are definitely positive that the calculations were still running (also shown as such in slurm), rather than the problem being that the calculations were finished correctly but then AiiDA, for some reason, retrieved the files incompletely? |
If the issue turns out to be related to the scheduler receiving too many requests, and AiiDA misinterpreting its output (or it gets hanging for some reason), it may help to increase the Of course, this would just be a workaround, not a solution of the underlying problem. P.S. The
|
The issue is definitely that the calculations are still running and are shown as such in slurm or pbspro when aiida retrieves them. I've seen it hundreds of times now. I have in fact increased the safe interval to 60s for one of the computers we use. This didn't seem to have any influence on the issue as far as I can say. It is definitely still present.This is a computer that uses pbspro and the qstat command is sometimes responding very slowly. Interestingly, it seems that the qstat command is only slow when I have aiida calculations running on this computer, though this may be a coincidence. Is it possible that even with increasing the safe_interval to 60s, aiida is flooding the pbspro server with too many requests? It seems strange to me since this is a large supercomputer with many users and hundreds of jobs in the queue. I can try to increase the safe_interval even more. The other computer we use is a smaller cluster, which uses slurm. There the slurm is very responsive, but sometimes the ssh connection to the cluster is unstable and can drop out suddenly. Perhaps I should also mention that the issue persists when upgrading to aiida v1.0.0 and that I'm using python2. |
Thanks @Zeleznyj, this is really helpful to know and we absolutely need to look into this.
Hm... can you try what I suggested above (to increase the minimum job poll interval) instead (increase e.g. to 120s - or some value that is reasonable given the typical duration of your calculations)? @sphuber I wonder: What does currently happen when you have a very low Also, is the status obtained from the scheduler shared between workers? |
No, there is no sharing of information between workers on this level. Each worker has an instance of a |
I have tried to increase the minimum_job_poll_interval to 120 s and I still see the issue. I have also reproduced the issue on a different computer, which is running on Ubuntu 18.04 instead of Debian. |
could you please run |
I will try this, but I'm sure that the retrieval starts before the output files are completely written since I routinely have to kill jobs that are running on the remote computer or are even still in the queue, which are finished within aiida. |
As an additional data point, I think I've had this happen when the input given was "wrong" in some sense (in my case, an empty k-points mesh for a QE @Zeleznyj are these calculations retrieved immediately after submission, or only after some time? |
@Zeleznyj I am working on a branch that will improve the accounting with SLURM and add logging that hopefully will help in pinpointing the problem. You are using SLURM correct? I will hopefully be able to send it to you tomorrow for you to try out. |
@Zeleznyj please check out this branch from my fork, restart the daemon and continue launching. When you spot a problem again, please report here the output of |
I will try. I'm using two remote computers, one is using SLURM and the other PBSPRO. I see the issue on both so I can test now with SLURM. @greschd The calculation are retrieved seemingly randomly after some time and in my case this is not connected to wrong input. |
@sphuber I tried running calculation on your branch, but all of the calculations I run get stuck at 'Waiting for transport task: accounting stage', this is the corresponding error from daemon log:
|
Sorry about that, please go into |
Here is the info for a job where the issue occured: verdi node report:
verdi node attributes:
I also attach the relevant part of daemon log. I don't understand what the scheduler error means. We see this sometimes, I'm not sure if this is related, but I saw the same issue also in cases where this error was not present. Despite this error, the calculation is running fine on the remote. At this time it is still running, so it is definitely not the case that aiida would just be retrieving incomplete files. |
The scheduler error should not pose a problem, especially given that you have seen this with other jobs where the retrieval is fine. The engine simply tried to submit the job by calling Just to be sure, the message we see in the scheduler stderr:
do you also see this in jobs that run just fine? As in, we should not have to worry about this? I take it it comes from the code itself that you are running. The only thing I can think of now to further debug this is to add something in the prepend and append texts of this code to echo the job id. For example
and
In the prepend and append texts respectively. This will echo the job id when it actually runs to the two files, which we can then later check for a job with this failure. This is just to cross-verify that the job_id that get's stored in the node's attributes after submission is the same one that actually finally runs. Maybe a switch happens for some reason such that the engine actually checks the wrong job id. You can add these lines for the append and prepend text on a per calculation basis through the options (so you don't have to configure a new code), but that choice is up to you. As a last thing, I would maybe add one line of debugging in the job manager to print the actual response of the scheduler when we request the state of the jobs. I will see if I can add this to the branch tomorrow and let you know so you can try it out. Thanks a lot for all the detailed reports so far and you patience 👍 |
I think that's crucial to debugging this issue (and perhaps we could even make it possible to get this at DEBUG level in aiida-core) |
I added another commit to my branch
Hopefully we can correlate the job id of a failed calculation to the output we see here, i.e. if it is there before and after retrieval started. Also please launch your calculations with the following option in the inputs:
|
I will try this out. Regarding the error in stderr, I now think it may be related. I did not think it's related before because the error is not present in all calculations which have the issue. However, I now went through all of the previous calculations and I found that none of the calculations that finish correctly have the error. So maybe there are two issues or two different origins for the issue. I will continue testing. |
It seems something is amiss with the jobids. I've encoutered another calculation where there's error in stderr, but the calculation finishes correctly. The jobid in the prepend.jobid is 191160, which agrees with the jobid in the attributes, whereas the jobid in the append.jobid is 191147. I've checked now that in the previous case I mentioned, there also the jobid of the running job was different from the one in attributes. Here are the attributes:
Here is the report:
I forgot to change the debug status to DEBUG, so I don't have the additional debug output. |
Right, let's see whether we can get this started. |
Yes, I had a situation on using custom scheduler plugin related to issue #2977 |
In case we are not able to solve the 'submit several' calculation part, due to submission failures. One possibility to make the calculations at least work fine is to ensure different running directories for every submitted calcjob. I.e if aiida tries to resubmit something the directory where the job actually is executed is changed, ensuring that never two running jobs use the same running directory and of course that aiida knows from which to retrieve the files. For this one could introduce a sub directory structure on the remote machine like |
Coming back to the original issue:
|
I wouldn't be so sure of this. I spent a lot time debugging this and once I could really trace what happened, the problem was really clear. We can verify this if you add the following to your inputs:
if my suspicion is correct, you will see different jobids in the
This is almost certainly unrelated and just due to a bug that I fixed in PR #3889 which is release in
In principle not, but these were really added just because we didn't even know where to look in the beginning. I don't feel like merging those kinds of ad-hoc changes. If you really do feel it is important, then I would first clean it up, apply it to all schedulers as well so at least the changes are consistent across the board. |
Ah ok, so you think it's really not due to parsing of the scheduler status but the return code when submitting the job. I just did some simple tests on fidis, submitting many jobs in short succession [1,2]. The Is perhaps instead the network connection the issue? Note also @zhubonan's comment, who pointed out he ran into this type of issue when the internet connection dropped. The only other idea that comes to my mind would be that In order to continue these tests, I guess the best would be to now move to testing from my machine, and setting up a machinery that is more and more similar to what happens inside AiiDA. P.S. Perhaps unrelated, but there are python libraries that seem to be performance-oriented replacements for paramiko, see #3929 and may be worth investigating. [1] Fidis has a limit of 5000 maximum submitted jobs in running/pending state per user ( [2] job script #!/bin/bash -l
#SBATCH --job-name=aiida-submission-debug
#SBATCH --nodes=1 #max 2 for debug
#SBATCH --ntasks=1 #28 cpus per node on fidis
#SBATCH --ntasks-per-node=1 #28 cpus per node on fidis
#SBATCH --time=0:00:10 #max 1h for debug
##SBATCH --partition=debug #debug partition
#SBATCH --partition=parallel
echo "SLURM ID $SLURM_JOB_ID; TEST_ID $TEST_ID" test script #!/bin/bash
set -e
#set -x
date +"%T.%3N"
for i in {1..500}; do
export TEST_ID=$i
sbatch test.slurm
echo "TEST $TEST_ID : exit code $?"
done
date +"%T.%3N" |
Well this is the case we fully confirmed with the OP of this thread. To resume, the following seems to be happening:
The problem clearly originates in point 3. The question is, whose fault is it? Is the failure to communicate because SLURM is overloaded and times out to respond in time? Or are there connection issues and the SSH connection is killed before AiiDA can receive the message in full. Here I am not fully sure, it could be both, but given that they say this happens under heavy load of the scheduler (which is on a relatively small machine) it might rather be SLURM that is overloaded.
Yes, like any other transport task, if this fails, it will hit the exponential backoff mechanism (EBM) and try again. Since this is a transient problem for checking the status, this should not be a problem.
Honestly, the best way is to actually use AiiDA. You can even use a dummy script and class, such as the |
Just as a brief update - I've submitted 1000 calculations in one go on fidis using the script below (using 3 daemon workers and 500 slots per worker) and all finished fine. Note: Since the "calculation" I'm running would not crash even if the previous one had started running, I'm simply appending job IDs to from aiida import orm
from aiida.plugins import DataFactory, CalculationFactory
from aiida import engine
import os
diff_code = orm.Code.get_from_string('diff@fidis')
DiffParameters = DataFactory('diff')
parameters = DiffParameters({'ignore-case': True})
SinglefileData = DataFactory('singlefile')
file1 = SinglefileData(
file=os.path.abspath('f1'))
file2 = SinglefileData(
file=os.path.abspath('f2'))
SFC = CalculationFactory('diff')
for i in range(1000):
builder = SFC.get_builder()
builder.code = diff_code
builder.parameters = parameters
builder.file1 = file1
builder.file2 = file2
builder.metadata.description = "job {:03d}".format(i)
builder.metadata.options.prepend_text = 'echo $SLURM_JOB_ID >> prepend.jobid'
builder.metadata.options.append_text = 'echo $SLURM_JOB_ID >> append.jobid'
builder.metadata.options.max_wallclock_seconds = 10
builder.metadata.options.resources = {
'num_machines': 1,
'num_mpiprocs_per_machine': 1,
}
result = engine.submit(SFC, **builder)
print(result) |
The scenario described by @sphuber is indeed something I saw happening and I think we clearly confirmed that, but this is not the only case where I see this issue. We also see that the calculations are sometimes incorrectly retrieved even if there is no submission error. This happens sometimes after a long time and sometimes even if the job is still in the queue so really doesn't seem like the first issue. For me, this second issue is much more common than the first. I saw the second issue mainly with a remote that's using PBSPRO and this is where I tested it, whereas the first issue I saw mainly with a remote that's using SLURM. In this case it seems that the problem seems to be that aiida recieves an incomplete output from qstat. I'm attaching the I think the problem is somehow related to this PBSPRO server being often overloaded and very slowly responsive. I'm not running there any calculations right now, so I cannot test, but I should be able to test in a month or so. |
Thanks @Zeleznyj for the report!
EDIT: I think the problem is that the scheduler (I think) fails with the following error code: I.e., the return value (error code) is not zero, but 141. Is you scheduler PBSPRO, and which version? I couldn't find this error here, even when checking numbers where Probably the cause is this: aiida-core/aiida/schedulers/plugins/pbsbaseclasses.py Lines 362 to 369 in e8e7e46
However, as mentioned in the comment, this was done because if AiiDA passes an explicit list of jobs, as it does, you would get a non-zero error code (and this is very common: it's exactly when you ask for a finished job. I don't have access to PBSpro anymore. Also, you could try to run a "working" qstat many times in a loop, and check if you randomly get a 141 error (or some other error). Finally the best would be if we find some documentation of what 141 exactly means. I see a few options:
A final option is that 141 does not come from qstat, but e.g. from |
By the way, it might be comforting to see that there is a distinguishable error message and that this is either |
The PBSPRO version is 19.2.4. I've tested and when a job is missing the error is indeed 153. One possibility is that the 141 is related to some time-out of the request. It seems that in this particular case the response took a very long time. From the log file it seems that the qstat command was issued at 12:09 and the response was received only at 12:25.
This slow responsiveness happens quite often with this server. I will try now to see if I can reproduce the 141 error. It may be difficult to reproduce it though since I strongly suspect now that the overloading of the server is in fact caused by aiida itself and I'm not running any calculations now. I've increased the minimum_job_poll_interval to 180, but I still see that when I run calculations with aiida on this server, the responsives drops dramatically. The problem might be that I was normally running 5 daemon workers and I suppose each one is sending requests independently. |
Hi @Zeleznyj |
Hi @pzarabadip, yes this was at IT4Innovations. |
Yes, this is correct, the connection and polling limits are guaranteed per daemon worker. So if you have a minimum polling interval with 5 active workers, you can expect 5 poll requests per 180 seconds. The same goes for the SSH transport minimum interval. What have you set that to? You can see it with |
Great. I'd like to share a recent experience there which may help debugging this issue or at least rule out one of possible sources of issue. I am running loads of I've investigated this from my side ( |
I cannot see the safe_interval in I have done some testing now with PBSPRO. The error code of @pzarabadip I don't think I have this issue, for me the connection to IT4I has been very stable. It's interesting that you haven't seen the same issue as I have though (with jobs being retrieved while still running). |
Yeah, this is in |
I just ran into this issue on Piz Daint, which uses a SLURM scheduler. The parsing of the output file failed and after some investigation it is clear that it is because two processes were writing to it at the same time. Sure enough, looking in the process report, I saw that the submit task failed once. The scheduler stdout also contained the output of two SLURM jobs. So this is a clear example where the scheduler submits the first request but then the communication of the result to AiiDA times out and so AiiDA submits again, resulting in two processes running in the same folder. I notice that this was in a particularly busy moment on the cluster. Yesterday was the end of the allocation period and the queue was extraordinarily busy. Calling I don't think it will be easy or possible to detect this and prevent the second submission fully, but at least we can adapt the submission scripts to write a lock file when executed and if it is present, we abort. This can then be parsed by the scheduler parsing that I have implemented (still not merged but in an open PR) which can then at least fail the calculation with a well defined exit code, and we won't have to search as long. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Thanks for the writeup @ltalirz . I agree with the analysis that the |
Sure, opening #4326 and moving my commnents there (I'll hide them here to not make this thread longer than it needs to be) |
I've encountered an issue where sometimes a calculation will show as finished in aiida, but the actual calculation on remote computer is still running. Aiida will retrieve the files and run parser without showing any error. This happened with ssh transport and slurm scheduler. I'm not sure if the problem is necessarily related to slurm though, since we are now not using other schedulers much. The calculations are using our own FPLO calculation plugins. It is possible that the issue is somehow related to some problem in the plugins, but to me it seems like a problem with aiida, since everything on our side is working fine. The calculation is submitted correctly and finishes correctly, the only problem is that the results are retrieved before the remote calculation is finished. This thus looks like a problem with parsing the queue status. The problem happens randomly, when we resubmit a calculation, it will usually finish fine.
I've noticed the problem after checking out the develop branch couple days ago, but most likely the problem existed also before, when I was using the 1.0.06b version.
I can try to include more details, but I'm not sure where to start about debugging this.
The text was updated successfully, but these errors were encountered: