Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out #3404

Open
Zeleznyj opened this issue Oct 8, 2019 · 78 comments

Comments

@Zeleznyj
Copy link

Zeleznyj commented Oct 8, 2019

I've encountered an issue where sometimes a calculation will show as finished in aiida, but the actual calculation on remote computer is still running. Aiida will retrieve the files and run parser without showing any error. This happened with ssh transport and slurm scheduler. I'm not sure if the problem is necessarily related to slurm though, since we are now not using other schedulers much. The calculations are using our own FPLO calculation plugins. It is possible that the issue is somehow related to some problem in the plugins, but to me it seems like a problem with aiida, since everything on our side is working fine. The calculation is submitted correctly and finishes correctly, the only problem is that the results are retrieved before the remote calculation is finished. This thus looks like a problem with parsing the queue status. The problem happens randomly, when we resubmit a calculation, it will usually finish fine.

I've noticed the problem after checking out the develop branch couple days ago, but most likely the problem existed also before, when I was using the 1.0.06b version.

I can try to include more details, but I'm not sure where to start about debugging this.

@sphuber
Copy link
Contributor

sphuber commented Oct 10, 2019

So just to be sure, you are using the standard slurm plugin that ships with aiida-core?

Are all the files you expect present in the retrieved folder? You can check by first getting the corresponding pk from verdi process show. The retrieved node should be listed under outputs. Then with verdi node repo ls you can show its contents. Are the files in there complete or do they contain only partial output. Could you also share the output of verdi node attributes <calcjob_pk>?

@Zeleznyj
Copy link
Author

Zeleznyj commented Oct 10, 2019

Yes, I'm using the default slurm plugin.

All files in the retrieved node are present, but they contain only partial output. When I use verdi calcjob gotocomputer to check the remote, there the files contain full output, however, the retrieved files are not complete since they have been retrieved before the calculation has finished.

Here is the output of verdi node attributes command:

{
    "append_text": "",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_status": 0,
    "import_sys_environment": true,
    "job_id": "176740",
    "last_jobinfo": "{\"job_id\": \"176740\", \"wallclock_time_seconds\": 1456, \"title\": \"aiida-5941\", \"num_machines\": 1, \"detailedJobinfo\": null, \"queue_name\": \"newnodes\", \"num_mpiprocs\": 1, \"allocated_machines_raw\": \"node98\", \"submission_time\": {\"date\": \"2019-10-09T14:05:39.000000\", \"timezone\": null}, \"job_owner\": \"zeleznyj\", \"annotation\": \"None\", \"requested_wallclock_time_seconds\": 259200, \"raw_data\": [\"176740\", \"R\", \"None\", \"node98\", \"zeleznyj\", \"1\", \"1\", \"node98\", \"newnodes\", \"3-00:00:00\", \"24:16\", \"2019-10-09T14:05:40\", \"aiida-5941\", \"2019-10-09T14:05:39\"], \"job_state\": \"running\", \"dispatch_time\": {\"date\": \"2019-10-09T14:05:40.000000\", \"timezone\": null}}",
    "max_memory_kb": 3000000,
    "max_wallclock_seconds": 259200,
    "mpirun_extra_params": [],
    "output_filename": "fplo_out",
    "parser_name": "fplo",
    "prepend_text": "",
    "process_label": "FploCalculation",
    "process_state": "finished",
    "queue_name": "newnodes",
    "remote_workdir": "/zfs_data/zeleznyj/aiida_pc109/5f/1b/8ab9-7abe-4680-8b1b-7ba964bb5946",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "fplo_out",
        "=.dens",
        "=.in",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-10-09T12:30:12.371370+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0b6",
        "plugin": "0.5.5"
    },
    "withmpi": false
}

@sphuber
Copy link
Contributor

sphuber commented Oct 10, 2019

Is there anything peculiar in the content of the scheduler output files?
You can display those with verdi calcjob outputcat <PK> _scheduler-stderr.txt and verdi calcjob outputcat <PK> _scheduler-stdout.txt.

Does this happen for all calculations of all types? Or only of this type? Or even only for some calculations of this type?

It would be good if you have no calculations running anymore to set verdi config logging.aiida_loglevel DEBUG followed by verdi daemon restart --reset. Then submit 1 calculation and run verdi daemon logshow and copy the output printed there while that calculation is running to a file and paste it here.

@Zeleznyj
Copy link
Author

The scheduler output files are empty.

We have seen this happening with one calculation type only and with one remote computer. However, this doesn't have to mean anything since vast majority of calculations we have done so far were with this calculation and remote computer.

It seems to be happening randomly with most calculations having no issue. If I repeat exactly the same calculation where the problem occurred it will usually finish fine.

I will try to turn on the debugging and will post the output once the problem occurs.

@Zeleznyj
Copy link
Author

I have managed to reproduce the issue again, so I can provide some more info. The issue happens also with different remote computer, with the pbspro scheduler and with a different calculation plugin. We used a slightly modified version of the pbspro scheduler, but I don't think the issue is related to the scheduler since the issue seems to be exactly the same with the slurm scheduler: Aiida at some points starts to retrieve the calculation even though the calculation is still running or in the queue.

I was unable to reproduce the issue at first, it seems to be quite random, however, now I've run hundreds of calculations and can see the issue in some. Because of this, the whole log file is massive, so I'm uploading only parts relevant to some of the calculation where the issue occurred. I can provide the whole file if necessary, but it has over 1GB. Even the cut parts are still quite large so I put them in dropbox. The files are named according to the process pk for which the process occurred.

I often see the following error in verdi process report

+-> ERROR at 2019-10-17 22:42:48.875613+00:00
 | Traceback (most recent call last):
 |   File "/jakzel/aiida/aiida_code/aiida-core/aiida/engine/utils.py", line 176, in exponential_backoff_retry
 |     result = yield coro()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
 |     yielded = self.gen.throw(*exc_info)
 |   File "/jakzel/aiida/aiida_code/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 175, in do_update
 |     job_info = yield cancellable.with_interrupt(update_request)
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
 |     yielded = self.gen.throw(*exc_info)
 |   File "/jakzel/aiida/aiida_code/aiida-core/aiida/engine/utils.py", line 93, in with_interrupt
 |     result = yield wait_iterator.next()  # pylint: disable=stop-iteration-return
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "<string>", line 3, in raise_exc_info
 | SchedulerError: Error during squeue parsing (_parse_joblist_output function)

but it seems to be unrelated since it also happens with calculations that finish fine, and not all calculations that have the issue have this error.

@Zeleznyj
Copy link
Author

Zeleznyj commented Oct 25, 2019

Are there some things I could try to fix this issue or to investigate more?

It seems that we only see this issue on one particular computer (though it's hard to be sure because of the randomness), which is running on Debian 10, whereas otherwise we have been using computers running Ubuntu. Is it possible that the issue is related to the operating system, or to version of some package like RabbitMQ? I can try to experiment with this, but it's quite time consuming, because it's necessary to run a lot of calculations to verify whether the issue is there.

@sphuber
Copy link
Contributor

sphuber commented Oct 25, 2019

Sorry for not responding earlier @Zeleznyj I have been swamped with the release of aiida-core==1.0.0 of next week. I will try to have a look after that, but I am afraid that as it is already difficult for you to reproduce, even moreso for me. I will have to base a potential cause on speculation and then see if there is anyway to probe/test this. I will think about it and come back to you with hopefully something that you can test.

@ltalirz
Copy link
Member

ltalirz commented Nov 13, 2019

@danieleongari at LSMO seems to run into the same issue, i.e. this seems confirmed.
We're investigating to get an idea of what could be going wrong...

@ltalirz
Copy link
Member

ltalirz commented Nov 13, 2019

@Zeleznyj thanks for your report.

Aiida at some points starts to retrieve the calculation even though the calculation is still running or in the queue.

Just to be absolutely sure (since this is important): You are definitely positive that the calculations were still running (also shown as such in slurm), rather than the problem being that the calculations were finished correctly but then AiiDA, for some reason, retrieved the files incompletely?
That would indeed be very helpful information.

@ltalirz
Copy link
Member

ltalirz commented Nov 13, 2019

If the issue turns out to be related to the scheduler receiving too many requests, and AiiDA misinterpreting its output (or it gets hanging for some reason), it may help to increase the safe_interval using verdi computer configure ssh <computername> (see current value in verdi computer configure show <computername>).
The default value is 5s (per daemon worker), which could lead to issues if you have a lot of workers active.

Of course, this would just be a workaround, not a solution of the underlying problem.

P.S. The safe_interval limits the overall rate of SSH connections per worker, which is not strictly necessary here.
I would be enough to just increase the interval between polls of the scheduler - there is no cli for it yet (we should add this #3541) but you can do it from the verdi shell:

In [1]: c = load_computer(1)

In [2]: c.get_minimum_job_poll_interval()
Out[2]: 10.0

In [3]: c.set_minimum_job_poll_interval(30)

@Zeleznyj
Copy link
Author

The issue is definitely that the calculations are still running and are shown as such in slurm or pbspro when aiida retrieves them. I've seen it hundreds of times now.

I have in fact increased the safe interval to 60s for one of the computers we use. This didn't seem to have any influence on the issue as far as I can say. It is definitely still present.This is a computer that uses pbspro and the qstat command is sometimes responding very slowly. Interestingly, it seems that the qstat command is only slow when I have aiida calculations running on this computer, though this may be a coincidence. Is it possible that even with increasing the safe_interval to 60s, aiida is flooding the pbspro server with too many requests? It seems strange to me since this is a large supercomputer with many users and hundreds of jobs in the queue.

I can try to increase the safe_interval even more.

The other computer we use is a smaller cluster, which uses slurm. There the slurm is very responsive, but sometimes the ssh connection to the cluster is unstable and can drop out suddenly.

Perhaps I should also mention that the issue persists when upgrading to aiida v1.0.0 and that I'm using python2.

@ltalirz
Copy link
Member

ltalirz commented Nov 14, 2019

Thanks @Zeleznyj, this is really helpful to know and we absolutely need to look into this.

I have in fact increased the safe interval to 60s for one of the computers we use. This didn't seem to have any influence on the issue as far as I can say.

Hm... can you try what I suggested above (to increase the minimum job poll interval) instead (increase e.g. to 120s - or some value that is reasonable given the typical duration of your calculations)?
I happen to know that @sphuber , who has not been experiencing this issue in ~5000 calculations, has increased this value.

@sphuber I wonder: What does currently happen when you have a very low minimum_job_poll_interval and a high safety_interval (say, 10x higher).
When the next SSH connection is allowed, is it possible that AiiDA tries to poll the scheduler 10 times within the same connection?

Also, is the status obtained from the scheduler shared between workers?

@sphuber
Copy link
Contributor

sphuber commented Nov 14, 2019

@sphuber I wonder: What does currently happen when you have a very low minimum_job_poll_interval and a high safety_interval (say, 10x higher).
When the next SSH connection is allowed, is it possible that AiiDA tries to poll the scheduler 10 times within the same connection?
Also, is the status obtained from the scheduler shared between workers?

No, there is no sharing of information between workers on this level. Each worker has an instance of a JobManager which takes care of polling the scheduler. The processes run by a daemon worker, will request an update of their status with this JobManager which will poll the scheduler, respecting the minimum_job_interval set on the given computer. To do so, it will first request a transport from the TransportQueue. This latter one ensure that the safe_interval is respected. So in the worst case, a new scheduler update will occur every safe_interval + minimum_job_interval seconds.

@Zeleznyj
Copy link
Author

I have tried to increase the minimum_job_poll_interval to 120 s and I still see the issue.

I have also reproduced the issue on a different computer, which is running on Ubuntu 18.04 instead of Debian.

@sphuber
Copy link
Contributor

sphuber commented Nov 19, 2019

could you please run verdi config logging.aiida_loglevel INFO and restart the daemon. This should increase the logging level and print messages in the daemon log when the engine starts the retrieval of a certain calculation job. When you encounter another problematic run, try to find its pk in the log. There should be messages where it states that it is starting to retrieve. Please compare the timestamps of those log messages with the output of ls on the remote working directory to confirm that retrieval starts before the output files are completely written.

@Zeleznyj
Copy link
Author

I will try this, but I'm sure that the retrieval starts before the output files are completely written since I routinely have to kill jobs that are running on the remote computer or are even still in the queue, which are finished within aiida.

@greschd
Copy link
Member

greschd commented Nov 19, 2019

As an additional data point, I think I've had this happen when the input given was "wrong" in some sense (in my case, an empty k-points mesh for a QE pw.x calculation). I didn't follow up on the error because it was just wrong input, but maybe this means that such a state can happen if there is a curious exception while submitting the calculation?

@Zeleznyj are these calculations retrieved immediately after submission, or only after some time?

@sphuber
Copy link
Contributor

sphuber commented Nov 19, 2019

@Zeleznyj I am working on a branch that will improve the accounting with SLURM and add logging that hopefully will help in pinpointing the problem. You are using SLURM correct? I will hopefully be able to send it to you tomorrow for you to try out.

@sphuber
Copy link
Contributor

sphuber commented Nov 19, 2019

@Zeleznyj please check out this branch from my fork, restart the daemon and continue launching. When you spot a problem again, please report here the output of verdi process report, verdi node attributes and the contents of the daemon log for pk of the calculation that retrieved too early. Thanks a lot

@Zeleznyj
Copy link
Author

I will try.

I'm using two remote computers, one is using SLURM and the other PBSPRO. I see the issue on both so I can test now with SLURM.

@greschd The calculation are retrieved seemingly randomly after some time and in my case this is not connected to wrong input.

@Zeleznyj
Copy link
Author

@sphuber I tried running calculation on your branch, but all of the calculations I run get stuck at 'Waiting for transport task: accounting stage', this is the corresponding error from daemon log:

Traceback (most recent call last):
  File "/home/kuba/bin/aiida/aiida/engine/utils.py", line 176, in exponential_backoff_retry
    result = yield coro()
  File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 239, in do_accounting
    job_accounting = scheduler.get_job_accounting(job_id)
  File "/ho  File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 239, in do_accounting
    job_accounting = scheduler.get_job_accounting(job_id)
  File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 757, in get_job_accounting
    return self._parse_detailed_jobinfo(stdout)
  File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 773, in _parse_detailed_jobinfo
    self.logger.warning('scheduler job accounting details is empty so probably accounting is not enabled.')
NameError: global name 'self' is not defined

@sphuber
Copy link
Contributor

sphuber commented Nov 20, 2019

Sorry about that, please go into aiida.schedulers.plugins.slurm and on line 759 remove the @staticmethod decorator and then add self as a method argument, i.e. it should become def _parse_detailed_jobinfo(self, jobinfo_details). Then restart the daemon and try again. This already shows me though that unfortunately the sacct command on your SLURM scheduler is not enabled. This makes it difficult to get the additional information we need.

@Zeleznyj
Copy link
Author

Here is the info for a job where the issue occured:

verdi node report:

*** 14720: None
*** (empty scheduler output file)
*** Scheduler errors:
forrtl: severe (24): end-of-file during read, unit -4, file /proc/16945/fd/0
Image              PC                Routine            Line        Source             
fplo18.00-53-x86_  00000000010FE725  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000929CB2  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000B0F4BF  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D4EAD1  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D5762A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406896  Unknown               Unknown  Unknown
fplo18.00-53-x86_  00000000011EFAB0  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406757  Unknown               Unknown  Unknown

*** 2 LOG MESSAGES:
+-> WARNING at 2019-11-21 15:48:17.173206+00:00
 | Job info not present, assuming it was completed
+-> ERROR at 2019-11-21 15:47:35.221711+00:00
 | Traceback (most recent call last):
 |   File "/home/kuba/bin/aiida/aiida/engine/utils.py", line 176, in exponential_backoff_retry
 |     result = yield coro()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1069, in run
 |     yielded = self.gen.send(value)
 |   File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 125, in do_submit
 |     raise Return(execmanager.submit_calculation(node, transport, calc_info, script_filename))
 |   File "/home/kuba/bin/aiida/aiida/engine/daemon/execmanager.py", line 284, in submit_calculation
 |     job_id = scheduler.submit_from_script(workdir, script_filename)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/scheduler.py", line 408, in submit_from_script
 |     return self._parse_submit_output(retval, stdout, stderr)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 473, in _parse_submit_output
 |     'stdout={}\nstderr={}'.format(retval, stdout, stderr))
 | SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
 | sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
 | 

verdi node attributes:

{
    "append_text": "",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_status": 0,
    "import_sys_environment": true,
    "job_accounting": {},
    "job_id": "190074",
    "max_memory_kb": 3000000,
    "max_wallclock_seconds": 259200,
    "mpirun_extra_params": [],
    "output_filename": "fplo_out",
    "parser_name": "fplo",
    "prepend_text": "",
    "process_label": "FploCalculation",
    "process_state": "finished",
    "queue_name": "bignodes",
    "remote_workdir": "/zfs_data/zeleznyj/aiida/9f/15/8c21-ce91-4189-a831-a49806bf0423",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "fplo_out",
        "=.dens",
        "=.in",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-11-21T15:48:17.185719+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0",
        "plugin": "0.7.3"
    },
    "withmpi": false
}

I also attach the relevant part of daemon log.
14720.log

I don't understand what the scheduler error means. We see this sometimes, I'm not sure if this is related, but I saw the same issue also in cases where this error was not present. Despite this error, the calculation is running fine on the remote. At this time it is still running, so it is definitely not the case that aiida would just be retrieving incomplete files.

@sphuber
Copy link
Contributor

sphuber commented Nov 21, 2019

The scheduler error should not pose a problem, especially given that you have seen this with other jobs where the retrieval is fine. The engine simply tried to submit the job by calling sbatch but it failed. However, the engine caught the problem and simply rescheduled it and the second try was successful. The problem is then afterwards. You can see the warning message that I put in the report. It says that when it queried the status of all jobs, this particular one was missing. This is not necessarily problematic as it is also what happens for calculations that actually complete and it is when we assume that it must have completed. There seems to be a problem then that the job id for this job was missing from the scheduler update response, but for the wrong reasons.

Just to be sure, the message we see in the scheduler stderr:

forrtl: severe (24): end-of-file during read, unit -4, file /proc/16945/fd/0

do you also see this in jobs that run just fine? As in, we should not have to worry about this? I take it it comes from the code itself that you are running.

The only thing I can think of now to further debug this is to add something in the prepend and append texts of this code to echo the job id. For example

echo $SLURM_JOBID > prepend.job_id.txt

and

echo $SLURM_JOBID > append.job_id.txt

In the prepend and append texts respectively. This will echo the job id when it actually runs to the two files, which we can then later check for a job with this failure. This is just to cross-verify that the job_id that get's stored in the node's attributes after submission is the same one that actually finally runs. Maybe a switch happens for some reason such that the engine actually checks the wrong job id. You can add these lines for the append and prepend text on a per calculation basis through the options (so you don't have to configure a new code), but that choice is up to you.

As a last thing, I would maybe add one line of debugging in the job manager to print the actual response of the scheduler when we request the state of the jobs. I will see if I can add this to the branch tomorrow and let you know so you can try it out.

Thanks a lot for all the detailed reports so far and you patience 👍

@ltalirz
Copy link
Member

ltalirz commented Nov 21, 2019

I would maybe add one line of debugging in the job manager to print the actual response of the scheduler when we request the state of the jobs.

I think that's crucial to debugging this issue (and perhaps we could even make it possible to get this at DEBUG level in aiida-core)

@sphuber
Copy link
Contributor

sphuber commented Nov 22, 2019

I added another commit to my branch fix_2431_scheduler_parsing with a DEBUG log message containing the raw output returned by SLURM when calling squeue. Could you please check it out, set logging to DEBUG, restart the daemon (ideally with single worker) and attempt to reproduce it. The daemon log now should contain rows like:

11/22/2019 04:02:11 PM <22393> aiida.scheduler.slurm: [DEBUG] Results of `_get_joblist_command`:
retval: 0
stderr: 
stdout: 
18300045^^^PD^^^Priority^^^n/a^^^sphuber^^^1^^^12^^^(Priority)^^^normal^^^10:00^^^0:00^^^N/A^^^aiida-116464^^^2019-11-22T16:00:44
18300091^^^PD^^^Priority^^^n/a^^^sphuber^^^1^^^12^^^(Priority)^^^normal^^^10:00^^^0:00^^^N/A^^^aiida-116468^^^2019-11-22T16:01:58

Hopefully we can correlate the job id of a failed calculation to the output we see here, i.e. if it is there before and after retrieval started.

Also please launch your calculations with the following option in the inputs:

inputs = {
....
    'metadata': {
        'options': {
            ...
            'prepend_text': 'echo $SLURM_JOB_ID > prepend.jobid',
            'append_text': 'echo $SLURM_JOB_ID > append.jobid',
        }
    }
}

@Zeleznyj
Copy link
Author

I will try this out.

Regarding the error in stderr, I now think it may be related. I did not think it's related before because the error is not present in all calculations which have the issue. However, I now went through all of the previous calculations and I found that none of the calculations that finish correctly have the error. So maybe there are two issues or two different origins for the issue. I will continue testing.

@Zeleznyj
Copy link
Author

It seems something is amiss with the jobids. I've encoutered another calculation where there's error in stderr, but the calculation finishes correctly. The jobid in the prepend.jobid is 191160, which agrees with the jobid in the attributes, whereas the jobid in the append.jobid is 191147.

I've checked now that in the previous case I mentioned, there also the jobid of the running job was different from the one in attributes.

Here are the attributes:

{
    "append_text": "echo $SLURM_JOB_ID > append.jobid",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_message": "Calculation did not produce all expected output files.",
    "exit_status": 100,
    "import_sys_environment": true,
    "job_accounting": {},
    "job_id": "191160",
    "max_memory_kb": 3000000,
    "max_wallclock_seconds": 259200,
    "mpirun_extra_params": [],
    "output_filename": "fplo_out",
    "parser_name": "fplowannier",
    "prepend_text": "echo $SLURM_JOB_ID > prepend.jobid",
    "process_label": "FploWannierCalculation",
    "process_state": "finished",
    "queue_name": "bignodes",
    "remote_workdir": "/zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "fplo_out",
        "+hamdata",
        "+band",
        "+wanband",
        "+wanbandtb",
        "+points",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-11-25T11:53:17.083456+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0",
        "plugin": "0.7.3"
    },
    "withmpi": false
}

Here is the report:

*** 17432: None
*** (empty scheduler output file)
*** Scheduler errors:
forrtl: severe (24): end-of-file during read, unit 57, file /zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20/+wancoeff
Image              PC                Routine            Line        Source
fplo18.00-53-x86_  00000000010FE725  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000CB328A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000C613FA  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D50362  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D5762A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406896  Unknown               Unknown  Unknown
fplo18.00-53-x86_  00000000011EFAB0  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406757  Unknown               Unknown  Unknown
forrtl: severe (24): end-of-file during read, unit 57, file /zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20/+wancoeff
Image              PC                Routine            Line        Source
fplo18.00-53-x86_  00000000010FE725  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000CB328A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000C613FA  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D50362  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D5762A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406896  Unknown               Unknown  Unknown
fplo18.00-53-x86_  00000000011EFAB0  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406757  Unknown               Unknown  Unknown

*** 4 LOG MESSAGES:
+-> ERROR at 2019-11-25 11:53:33.786617+00:00
 | parser returned exit code<100>: Calculation did not produce all expected output files.
+-> ERROR at 2019-11-25 11:53:33.782464+00:00
 | Found files '[u'+points', u'_scheduler-stderr.txt', u'_scheduler-stdout.txt', u'fplo_out']', expected to find '[u'fplo_out', '+hamdata']'
+-> WARNING at 2019-11-25 11:53:17.075818+00:00
 | Job info not present, assuming it was completed
+-> ERROR at 2019-11-25 11:51:43.678288+00:00
 | Traceback (most recent call last):
 |   File "/home/kuba/bin/aiida/aiida/engine/utils.py", line 176, in exponential_backoff_retry
 |     result = yield coro()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1069, in run
 |     yielded = self.gen.send(value)
 |   File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 125, in do_submit
 |     raise Return(execmanager.submit_calculation(node, transport, calc_info, script_filename))
 |   File "/home/kuba/bin/aiida/aiida/engine/daemon/execmanager.py", line 284, in submit_calculation
 |     job_id = scheduler.submit_from_script(workdir, script_filename)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/scheduler.py", line 408, in submit_from_script
 |     return self._parse_submit_output(retval, stdout, stderr)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 473, in _parse_submit_output
 |     'stdout={}\nstderr={}'.format(retval, stdout, stderr))
 | SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
 | sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
 |

I forgot to change the debug status to DEBUG, so I don't have the additional debug output.

@ltalirz
Copy link
Member

ltalirz commented Mar 23, 2020

Right, let's see whether we can get this started.
pinging also @pzarabadip for info, since he had a use case as well (have you already made a scheduler plugin out of it or not?).

@ezpzbz
Copy link
Member

ezpzbz commented Mar 23, 2020

Yes, I had a situation on using custom scheduler plugin related to issue #2977
At the time, I made a copy of pbspro plugin and added a separate entry point in aiida-core itself (ezpzbz@3ace86e) which has not been a convenient way of doing it.
I still have not moved it to my plugins based on @ltalirz nice suggestion. I will let you know once I do it (which would be soon). I can help to update the documents then about implementing custom scheduler plugin in plugins that can somehow address issue #3853

@broeder-j
Copy link
Member

In case we are not able to solve the 'submit several' calculation part, due to submission failures.

One possibility to make the calculations at least work fine is to ensure different running directories for every submitted calcjob. I.e if aiida tries to resubmit something the directory where the job actually is executed is changed, ensuring that never two running jobs use the same running directory and of course that aiida knows from which to retrieve the files. For this one could introduce a sub directory structure on the remote machine like <aiida-run_dir>/<uuid>/<submission_try_number>. I do not know how easy that could be done. Currently upload and submit are two separate tasks, the file upload and the preparations of the files happened already before.

@ltalirz
Copy link
Member

ltalirz commented Apr 18, 2020

Coming back to the original issue:

  • We seem to be experiencing similar issues with the slurm scheduler on fidis.epfl.ch (only, when submitting large numbers of jobs). This does make me question whether the issue is really on the scheduler side - in any case it means we need to find a workaround

  • @Zeleznyj mentions here that the DEBUG output of the _parse_joblist command is in the log twice: do you mean AiiDA prints things twice or there is some duplication in the pbspro output?
    It would be great to have a look at the log and the dropbox link you posted no longer works. Could you perhaps attach the log file (or at least a substantial section around the relevant part) directly to the comment?

  • @sphuber About the changes in your branch - do you think there is an issue with merging them into develop until this issue is resolved?
    It might make it easier for us to collect input on this issue. I've opened a PR of your rebased branch here Additional logging of job status #3942

@sphuber
Copy link
Contributor

sphuber commented Apr 18, 2020

We seem to be experiencing similar issues with the slurm scheduler on fidis.epfl.ch (only, when submitting large numbers of jobs). This does make me question whether the issue is really on the scheduler side - in any case it means we need to find a workaround

I wouldn't be so sure of this. I spent a lot time debugging this and once I could really trace what happened, the problem was really clear. We can verify this if you add the following to your inputs:

inputs = {
....
    'metadata': {
        'options': {
            ...
            'prepend_text': 'echo $SLURM_JOB_ID > prepend.jobid',
            'append_text': 'echo $SLURM_JOB_ID > append.jobid',
        }
    }
}

if my suspicion is correct, you will see different jobids in the prepend.job and append.jobid files. One of those will correspond to the jobid that is stored in the attributes. Please give that a go first.

@Zeleznyj mentions [here](https://github.com/aiidateam/aiida-core/issues/3404#issuecomment-570805521) that the DEBUG output of the `_parse_joblist` command is in the log _twice_: do you mean AiiDA prints things twice or there is some duplication in the pbspro output?
  It would be great to have a look at the log and  the dropbox link you posted no longer works. Could you perhaps attach the log file (or at least a substantial section around the relevant part) directly to the comment?

This is almost certainly unrelated and just due to a bug that I fixed in PR #3889 which is release in v1.2.0.

@sphuber About the changes in [your branch](https://github.com/sphuber/aiida_core/commits/fix_2431_scheduler_parsing)  - do you think there is an issue with merging them into `develop` until this issue is resolved?
  It might make it easier for us to collect input on this issue. I've opened a PR of your rebased branch here #3942

In principle not, but these were really added just because we didn't even know where to look in the beginning. I don't feel like merging those kinds of ad-hoc changes. If you really do feel it is important, then I would first clean it up, apply it to all schedulers as well so at least the changes are consistent across the board.

@ltalirz
Copy link
Member

ltalirz commented Apr 18, 2020

if my suspicion is correct, you will see different jobids in the prepend.job and append.jobid files. One of those will correspond to the jobid that is stored in the attributes. Please give that a go first.

Ah ok, so you think it's really not due to parsing of the scheduler status but the return code when submitting the job.
Happy to give that a try as well.

I just did some simple tests on fidis, submitting many jobs in short succession [1,2]. The sbatch command frequently hangs for a couple of seconds - e.g. submitting 500 jobs took about 87s; submitting 1000 jobs took 125s.
However, my script checks the exit code of sbatch and I have not been able to get a single non-zero exit code from sbatch.

Is perhaps instead the network connection the issue? Note also @zhubonan's comment, who pointed out he ran into this type of issue when the internet connection dropped.
@sphuber: If there is a connection issue while AiiDA checks for the job status, is there a mechanism in place for it to recover?

The only other idea that comes to my mind would be that sbatch somehow has a problem if the same user has multiple ssh connections open and submits from them at the same time (?).
It's certainly something that regular (non-AiiDA) users would basically never do.
That's also something I can try rather easily.

In order to continue these tests, I guess the best would be to now move to testing from my machine, and setting up a machinery that is more and more similar to what happens inside AiiDA.
Happy for pointers on how to best accomplish this.

P.S. Perhaps unrelated, but there are python libraries that seem to be performance-oriented replacements for paramiko, see #3929 and may be worth investigating.

[1] Fidis has a limit of 5000 maximum submitted jobs in running/pending state per user (MaxSubmitPU in sacctmgr list qos).

[2] job script

#!/bin/bash -l
#SBATCH --job-name=aiida-submission-debug
#SBATCH --nodes=1             #max 2 for debug
#SBATCH --ntasks=1  	      #28 cpus per node on fidis
#SBATCH --ntasks-per-node=1  #28 cpus per node on fidis
#SBATCH --time=0:00:10        #max 1h for debug
##SBATCH --partition=debug    #debug partition
#SBATCH --partition=parallel

echo "SLURM ID $SLURM_JOB_ID; TEST_ID $TEST_ID"

test script

#!/bin/bash
set -e
#set -x

date +"%T.%3N"
for i in {1..500}; do
  export TEST_ID=$i
  sbatch test.slurm
  echo "TEST $TEST_ID : exit code $?"

done
date +"%T.%3N"

@sphuber
Copy link
Contributor

sphuber commented Apr 19, 2020

Ah ok, so you think it's really not due to parsing of the scheduler status but the return code when submitting the job.

Well this is the case we fully confirmed with the OP of this thread. To resume, the following seems to be happening:

  1. AiiDA calls submit
  2. Scheduler actually receives it and launches the job (could of course be with some delay) with jobid = 1
  3. However, (Scheduler fails to send / AiiDA fails to receive) the response, and so the exponential backoff mechanism is triggered and AiiDA will try to resubmit. This can be seen from the logs

| SchedulerError: Error during submission, retval=1
| stdout=
| stderr=sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
| sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

  1. AiiDA submits again, in the same working directory, and this time it receives response from scheduler with jobid = 2.
  2. Job 2 now starts but crashes immediately because the starting files have already been modified by 1 (this behavior is of course code dependent) and job 1 continues to run
  3. AiIDA who polls for jobid=2 now sees the job is done, and starts to retrieve. Since job 2 crashed at the line of the main code in the submit script, the append text is never executed and this will contain the jobid = 1 once written by job 1. However, the prepend text will have been run by job 2 and so that will contain jobid = 2. The same goes for the jobid in the attributes, as the jobid =2 is the only value that AiiDA has successfully received. This corresponds perfectly with the results posted by @Zeleznyj

The problem clearly originates in point 3. The question is, whose fault is it? Is the failure to communicate because SLURM is overloaded and times out to respond in time? Or are there connection issues and the SSH connection is killed before AiiDA can receive the message in full. Here I am not fully sure, it could be both, but given that they say this happens under heavy load of the scheduler (which is on a relatively small machine) it might rather be SLURM that is overloaded.

Is perhaps instead the network connection the issue? Note also @zhubonan's comment, who pointed out he ran into this type of issue when the internet connection dropped.
@sphuber: If there is a connection issue while AiiDA checks for the job status, is there a mechanism in place for it to recover?

Yes, like any other transport task, if this fails, it will hit the exponential backoff mechanism (EBM) and try again. Since this is a transient problem for checking the status, this should not be a problem.

In order to continue these tests, I guess the best would be to now move to testing from my machine, and setting up a machinery that is more and more similar to what happens inside AiiDA.
Happy for pointers on how to best accomplish this.

Honestly, the best way is to actually use AiiDA. You can even use a dummy script and class, such as the ArithmeticAddCalculation and add the prepend_text and append_text to the options as I described and see if the results match my analysis above.

@ltalirz
Copy link
Member

ltalirz commented Apr 20, 2020

Just as a brief update - I've submitted 1000 calculations in one go on fidis using the script below (using 3 daemon workers and 500 slots per worker) and all finished fine.

Note: Since the "calculation" I'm running would not crash even if the previous one had started running, I'm simply appending job IDs to prepend.joblist and append.joblist.
In no case was the length of these files different from 1.

from aiida import orm
from aiida.plugins import DataFactory, CalculationFactory
from aiida import engine
import os

diff_code = orm.Code.get_from_string('diff@fidis')
DiffParameters = DataFactory('diff')
parameters = DiffParameters({'ignore-case': True})

SinglefileData = DataFactory('singlefile')
file1 = SinglefileData(
    file=os.path.abspath('f1'))
file2 = SinglefileData(
    file=os.path.abspath('f2'))

SFC = CalculationFactory('diff')

for i in range(1000):
    builder = SFC.get_builder()
    builder.code = diff_code
    builder.parameters = parameters
    builder.file1 = file1
    builder.file2 = file2
    builder.metadata.description = "job {:03d}".format(i)
    builder.metadata.options.prepend_text =  'echo $SLURM_JOB_ID >> prepend.jobid'
    builder.metadata.options.append_text =  'echo $SLURM_JOB_ID >> append.jobid'
    builder.metadata.options.max_wallclock_seconds = 10
    builder.metadata.options.resources = {
            'num_machines': 1,
            'num_mpiprocs_per_machine': 1,
    }

    result = engine.submit(SFC, **builder)
    print(result)

@Zeleznyj
Copy link
Author

The scenario described by @sphuber is indeed something I saw happening and I think we clearly confirmed that, but this is not the only case where I see this issue. We also see that the calculations are sometimes incorrectly retrieved even if there is no submission error. This happens sometimes after a long time and sometimes even if the job is still in the queue so really doesn't seem like the first issue. For me, this second issue is much more common than the first. I saw the second issue mainly with a remote that's using PBSPRO and this is where I tested it, whereas the first issue I saw mainly with a remote that's using SLURM.

In this case it seems that the problem seems to be that aiida recieves an incomplete output from qstat. I'm attaching the _parse_joblist_output from the log file: 206691.short.LOG. The jobs that should be present are: '9700431.isrv5' '9700454.isrv5' '9700388.isrv5' '9700387.isrv5' '9700408.isrv5' '9700463.isrv5'. I know all of these were still in fact present since I have a script separate for aiida that checks for running jobs. You see in the output that the last two are missing and that the output is suddenly cutoff during the output for job '9700387.isrv5'. So the last two jobs are retrieved even if they are still running.

I think the problem is somehow related to this PBSPRO server being often overloaded and very slowly responsive. I'm not running there any calculations right now, so I cannot test, but I should be able to test in a month or so.

@giovannipizzi
Copy link
Member

Thanks @Zeleznyj for the report!

I see 2 possibilities:

  • !the scheduler returns an incomplete list of jobs. This is a bit problematic, and I am not sure how to fix this... Maybe we can try to see if we need the extended full log, and use when possible the short job list, if this helps? It would be good to know if you see a truncated output if you run yourself the qstat command on the scheduler very often (e.g. you could run in a for loop, with a small 'sleep', and print the count of bytes with wc -c, and see if sometimes you get a very small number.~
  • ~AiiDA has a problem getting the full output, and only gets a truncated one. Maybe this is related to Avoid deadlocks when retrieving stdout/stderr via SSH #3787? Even if I would be surprised, as the file is truncated only at ~32K (suspicious number!), much smaller than what happens in Avoid deadlocks when retrieving stdout/stderr via SSH #3787 ~

EDIT: I think the problem is that the scheduler (I think) fails with the following error code:
retval: 141

I.e., the return value (error code) is not zero, but 141.

Is you scheduler PBSPRO, and which version? I couldn't find this error here, even when checking numbers where ERRNO%256==141, but I also don't know if this is the right place to look into (it seems the are the errors while submitting, not while calling qstat.

Probably the cause is this:

# pylint: disable=too-many-locals,too-many-statements,too-many-branches
# I don't raise because if I pass a list of jobs, I get a non-zero status
# if one of the job is not in the list anymore
# retval should be zero
# if retval != 0:
# _LOGGER.warning("Error in _parse_joblist_output: retval={}; "
# "stdout={}; stderr={}".format(retval, stdout, stderr))

However, as mentioned in the comment, this was done because if AiiDA passes an explicit list of jobs, as it does, you would get a non-zero error code (and this is very common: it's exactly when you ask for a finished job.
Or at least, this was the case some years ago when I wrote this.
So, one cannot just uncomment those lines.

I don't have access to PBSpro anymore.
If you wants to help, with PBSPro, you could check which error code do you get if you do a qstat -f command specifying 2 jobs that exist (hopefully this is zero), one that exists and one that doesn't, and two jobs that do not exist? (you can get this with something like qstat -f JOBID1 JOBID2 ; echo $? if I am not mistaken)

Also, you could try to run a "working" qstat many times in a loop, and check if you randomly get a 141 error (or some other error).

Finally the best would be if we find some documentation of what 141 exactly means.

I see a few options:

  • 141 is what is returned if at least a job is missing. Then my analysis above is possibly incorrect
  • we understand what is the error number issued when asking for a non-existing (e.g. completed) job, and this is always the same, distinguishable from other errors, and independent of the scheduler version. Then, we can uncomment the check for the retval, accepting both zero and this specific non-zero error code
  • there is no easy way to distinguish the error code for a non-existing (e.g. completed) job and an actual error. Then, we need to find more clever ways to detect these problems.

A final option is that 141 does not come from qstat, but e.g. from bash being interrupted or something like this. We might need to double check, depending on the results of the tests above.

@giovannipizzi
Copy link
Member

By the way, it might be comforting to see that there is a distinguishable error message and that this is either 15001 or 15001%256=153: see this source code line

@Zeleznyj
Copy link
Author

The PBSPRO version is 19.2.4. I've tested and when a job is missing the error is indeed 153. One possibility is that the 141 is related to some time-out of the request. It seems that in this particular case the response took a very long time. From the log file it seems that the qstat command was issued at 12:09 and the response was received only at 12:25.

12/14/2019 12:09:04 AM <2489> aiida.transport.SshTransport: [DEBUG] Command to be executed: cd '/home/zeleznyj' && qstat -f '9700431.isrv5' '9700454.isrv5' '9700388.isrv5' '9700387.isrv5' '9700408.isrv5' '9700463.isrv5'
12/14/2019 12:25:05 AM <2489> aiida.schedulers.plugins.pbsbaseclasses: [DEBUG] Results of `_parse_joblist_output`:

This slow responsiveness happens quite often with this server.

I will try now to see if I can reproduce the 141 error. It may be difficult to reproduce it though since I strongly suspect now that the overloading of the server is in fact caused by aiida itself and I'm not running any calculations now. I've increased the minimum_job_poll_interval to 180, but I still see that when I run calculations with aiida on this server, the responsives drops dramatically. The problem might be that I was normally running 5 daemon workers and I suppose each one is sending requests independently.

@ezpzbz
Copy link
Member

ezpzbz commented Apr 24, 2020

Hi @Zeleznyj
You are using IT4Innovations, am I right?

@Zeleznyj
Copy link
Author

Hi @pzarabadip, yes this was at IT4Innovations.

@sphuber
Copy link
Contributor

sphuber commented Apr 24, 2020

The problem might be that I was normally running 5 daemon workers and I suppose each one is sending requests independently.

Yes, this is correct, the connection and polling limits are guaranteed per daemon worker. So if you have a minimum polling interval with 5 active workers, you can expect 5 poll requests per 180 seconds. The same goes for the SSH transport minimum interval. What have you set that to? You can see it with verdi computer show it should be the safe_interval property

@ezpzbz
Copy link
Member

ezpzbz commented Apr 24, 2020

Hi @pzarabadip, yes this was at IT4Innovations.

Great. I'd like to share a recent experience there which may help debugging this issue or at least rule out one of possible sources of issue.

I am running loads of CP2K calculations there using AiiDA and never had any issue till this week. I started getting slow/no reponses at random times that it was even triggering exponential backoff mechanism. Again randomly sometimes it was getting solved or it was insisting for five consecutive times and resulting in putting all jobs on pause.

I've investigated this from my side (Masaryk University) and remote side (IT4I) and found out about the source of issue. IT4I login addresses (for example, salomon.it4i.cz) is a round-robin DNS record for actual login nodes which does the load balancing. In my case and after tracing back the issue, I realized that one of the actual login nodes is blocked from our side (false positive security alert) and therefore, when I was using the general address to connect, it was hanging randomly whenever blocked login node was being called.
You may ping the login nodes to see if they are all accessible.
Cheers,

@Zeleznyj
Copy link
Author

I cannot see the safe_interval in verdi computer show output. I have only found the Connection cooldown time in configure ssh, which is set to 30s. The slow response of the server is probably not just due to aiida. It seems that qstat returns output fast, but qstat -f is slow. I have now tried to run qstat -f with 50 job ids twice at the same time and it was very slow: the first one took 10 minutes and the second 35. This obviously becomes worse when I have 5 aiida daemon workers running, but I would say that this is mainly an issue with the server. I have tried the same thing on a different PBSPRO server and there the response is very fast. Interestingly, it is usually much faster to run qstat -f without any job ids, which returns information for all the running jobs.

I have done some testing now with PBSPRO. The error code of qstat -f when an already finished job is included is 35 (it's this one: #define PBSE_HISTJOBID 15139 /* History job ID *). So far the only error codes I saw were 35 (when job has already finished) and 153 (when I use nonexistent job id), but I continue testing. qstat will also print out an error message, when a job has finished or does not exist (for example qstat: 9872770.isrv5 Job has finished, use -x or -H to obtain historical job information).

@pzarabadip I don't think I have this issue, for me the connection to IT4I has been very stable. It's interesting that you haven't seen the same issue as I have though (with jobs being retrieved while still running).

@ltalirz
Copy link
Member

ltalirz commented Apr 24, 2020

I cannot see the safe_interval in verdi computer show output.

Yeah, this is in verdi computer configure show ... (because it was set with verdi computer configure). I also find this a bit confusing sometimes

@sphuber
Copy link
Contributor

sphuber commented Jul 1, 2020

I just ran into this issue on Piz Daint, which uses a SLURM scheduler. The parsing of the output file failed and after some investigation it is clear that it is because two processes were writing to it at the same time. Sure enough, looking in the process report, I saw that the submit task failed once. The scheduler stdout also contained the output of two SLURM jobs. So this is a clear example where the scheduler submits the first request but then the communication of the result to AiiDA times out and so AiiDA submits again, resulting in two processes running in the same folder.

I notice that this was in a particularly busy moment on the cluster. Yesterday was the end of the allocation period and the queue was extraordinarily busy. Calling squeue on the cluster directly also took a long time to respond. It therefore seems very likely that the scheduler just had trouble to keep up and respond to AiiDA in time and it timed out.

I don't think it will be easy or possible to detect this and prevent the second submission fully, but at least we can adapt the submission scripts to write a lock file when executed and if it is present, we abort. This can then be parsed by the scheduler parsing that I have implemented (still not merged but in an open PR) which can then at least fail the calculation with a well defined exit code, and we won't have to search as long.

@ltalirz

This comment has been minimized.

@ltalirz

This comment has been minimized.

@ltalirz

This comment has been minimized.

@ltalirz

This comment has been minimized.

@sphuber
Copy link
Contributor

sphuber commented Aug 26, 2020

Thanks for the writeup @ltalirz . I agree with the analysis that the detailed_job_info containing RUNNING for the status means the job status in the update task was incorrect. It indeed seems that the error is in the SLURM plugin that incorrectly parses the output. @ltalirz would you be willing to split your analysis of into a separate issue, and we keep this one for the original issue reported where the same job is submitted twice in the same working directory.

@sphuber sphuber changed the title Aiida incorrectly consider calculations to be finished Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out Aug 26, 2020
@ltalirz
Copy link
Member

ltalirz commented Aug 26, 2020

Sure, opening #4326 and moving my commnents there (I'll hide them here to not make this thread longer than it needs to be)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants