Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verdi computer test does not query jobs by user #2977

Open
ltalirz opened this issue Jun 7, 2019 · 5 comments
Open

verdi computer test does not query jobs by user #2977

ltalirz opened this issue Jun 7, 2019 · 5 comments

Comments

@ltalirz
Copy link
Member

ltalirz commented Jun 7, 2019

@pzarabadip ran into an issue where verdi computer test would hang

The reason is that verdi computer test runs qstat -f on this computer, which simply produces enormous output.
qstat -f -u<username> works just fine.

We should make sure that the username is passed here:
https://github.com/aiidateam/aiida_core/blob/521b77824c0e066f5ba0f58045b98f7a0269b9ef/aiida/cmdline/commands/cmd_computer.py#L65

For comparison, see what is done here;
https://github.com/aiidateam/aiida_core/blob/521b77824c0e066f5ba0f58045b98f7a0269b9ef/aiida/engine/processes/calcjobs/manager.py#L86-L91

@sphuber sphuber added aiida-core 1.x good first issue Issues that should be relatively easy to fix also for beginning contributors topic/schedulers topic/verdi type/bug labels Jun 15, 2019
@sphuber sphuber added this to the v1.0.1 milestone Jun 17, 2019
@ezpzbz
Copy link
Member

ezpzbz commented Jul 15, 2019

Hi @ltalirz ,
I just found an issue with this solution. I have submitted calculations and faced this issue that daemon pauses the process:

518  1h ago     NetworkCalculation   ⏸ Waiting        Pausing after failed transport task: update_calculation failed 5 times consecutively

I inspected the daemon log and found that it tries to get the job information but it fails:

 File "/storage/brno9-ceitec/home/pezhman/projects/git_repos/aiida-core-1.0.0b4/aiida/schedulers/plugins/pbsbaseclasses.py", line 404, in _parse_joblist_output
    raise SchedulerParsingError("I did not find the header for the first job")
aiida.schedulers.scheduler.SchedulerParsingError: I did not find the header for the first job

The reason is that qstat -f -u<username> does not produce the detailed information as the qstat -f does. It needs extra flag of -w to do so.
The other point regarding this issue would be related to HPC centers with different servers like one of ours. In this case, job may be executed on a different server rather the default one and therefore, we would face the issue the that qstat receives empty log. Therefore, my current command line in

command = ['qstat', '-f']

looks like:

command = ['qstat', '-f', '-w', '@<server1> @<server2> @<server3>']

I also did test the timings of qstat -f and qstat -f -w -u<username> which in my case are ~10s and ~0.1s, respectively.

@ltalirz
Copy link
Member Author

ltalirz commented Jul 15, 2019

@pzarabadip Thanks for the update!

The reason is that qstat -f -u does not produce the detailed information as the qstat -f does. It needs extra flag of -w to do so.

Interesting... which version of pbspro are you running?
It seems this lower-case -w flag is not documented for older versions of pbspro (?)
Or is this the same as -W?

The other point regarding this issue would be related to HPC centers with different servers like one of ours. In this case, job may be executed on a different server rather the default one and therefore, we would face the issue the that qstat receives empty log.

Ok, I guess this setup is not very common and we've never encountered it before.
@giovannipizzi Do you think it makes sense to include some optional extra string (@<server1>) for the pbspro class?

@ezpzbz
Copy link
Member

ezpzbz commented Jul 15, 2019

@ltalirz I am using the pbs_version = 19.0.0 which does not have -W.
I just tried this solution on pbs_version = PBSPro_13.1.1.162303 and it did not work but luckily there the qstat -f is fast enough to do the job.

@ltalirz
Copy link
Member Author

ltalirz commented Jul 15, 2019

@pzarabadip Ok, it looks like v14.1 already has the -w flag:
https://github.com/PBSPro/pbspro/blob/v14.1.0/doc/man1/qstat.1B#L63

Is it that the open source version has it while the closed-source version doesn't?
We could add a new scheduler plugin pbspro-open or something like this

@ezpzbz
Copy link
Member

ezpzbz commented Jul 15, 2019

@ltalirz Indeed, both versions have the -w flag in the help but the PBSPro_13.1.1.162303 does not take it into account as the other version and only prints list of jobs in long format.

@ltalirz ltalirz removed the good first issue Issues that should be relatively easy to fix also for beginning contributors label Feb 3, 2020
@sphuber sphuber removed this from the v1.1.1 milestone Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants