FIX: daemon hang when passing `None` as `job_id` #4967

ramirezfranciscof · 2021-06-01T13:27:21Z

Will fix #4752.

The behaviour fix

I implemented the catch in the get_detailed_job_info method of the abstract Scheduler class. If I'm not mistaken, this alone was enough to fix the problem. @atztogo can you please check that I understood correctly and this branch solves your issue?

I didn't add any raising mechanism on the specific implementation of _get_detailed_job_info_command inside the SgeScheduler class because it seemed too specific a fix (why are we not adding checks in the implementations of this method in other plugins? why are we not adding checks for type or explicit typing in the other methods of the plugin?), but I can open an issue to deal with this in a more holistic/complete way.

Unit testing an abstract class

I also added the test for both the new and the typical behaviours of get_detailed_job_info. I couldn't find any solid agreed upon way of easily unit testing abstract classes. In the end I decided to go with this way of fully overriding all __abstractmethods__ because it was the simpler one for this case, although it might be less ideal for more complex situations (it was already a bit annoying having to lamdba inline override the __init__ and mock the transport).

I didn't want to define a full manual "mock" class with an implementation for each abstract method of the Scheduler (even if it was just a pass), and I couldn't find any satisfying explanation for the mocking features/classes/methods of unittest (meaning nothing that allowed me to understand it deep enough to be comfortable using it, but without having to go through the full documentation, which could be worth it but would require more time). Also I wasn't sure if it was fair play to keep adding stuff from that module or if we should be favouring pytest only solutions for unit testing (such as subclassing TestCase).

I call upon thee higher beings of python if you have any feedback/advice on this topic ( @sphuber @dev-zero @greschd @chrisjsewell ).

dev-zero · 2021-06-01T13:44:13Z

Personally I'd rather handle the case of job_id being a None before calling scheduler.get_detailed_job_info in task_retrieve_job since there one still has access to the node object itself and can possibly determine whether or not node.get_job_id() returning a None is an error or not, while get_detailed_job_info does not know about the node object at all and should only assume that job_id is valid in the context of a schedulers job id, hence one should add an assert isinstance(job_id, str), "..." in get_detailed_job_info before the call to the schedulers implementation to guarantee that it is some form of valid job id.

sphuber · 2021-06-01T13:50:05Z

I tend to agree with @dev-zero . The do_retrieve inline function of task_retrieve_job should check the return value of node.get_job_id() and check for None. If this is the case, it should log a warning and say it cannot retrieve the detailed job info. It also raises the question how it was possible that the job id was None at all? What is the exact error case that you are addressing here? Where was this encountered and what were the details?

Edit: I see now in the discussion of the corresponding issue that this arises because this was not an actual run. It was actually an "immigrant" run so it is faking a job and just using the machinery to retrieve the results of an already completed job (maybe even run without AiiDA). In that case the job_id would indeed probably be None, but I think the point stands that the call to get_detailed_job_info should never even be made. Really this shows why we should have an official manner of immigrating completed jobs because this wouldn't even require a warning of the missing job id as that is to be expected in this use case.

giovannipizzi · 2021-06-02T20:42:09Z

If indeed we can avoid to call the get_detailed_job_info at all (and in this case it would be good anyway to raise if the method is called with None as a parameter), this would be better. However, I don't know if there is a simple way to do it that would also solve the original issue?

sphuber · 2021-06-03T06:19:50Z

However, I don't know if there is a simple way to do it that would also solve the original issue?

What I suggested would work just fine no? The job_id is None because the infrastructure is used to import an already completed job (this is what the VaspImmigrator is doing). The only change we need to apply is to update the retrieve task to check the return value of node.get_job_id() and if it is None it logs a warning, saying that it is not calling get_detailed_job_info because that would be pointless if there is no job id. As @dev-zero said, by handling this at the caller side, we can determine whether the job id being absent is a problem or not. In this case it is not because it is expected. It is important to note that this particular example is because the VaspImmigrator is "abusing" the CalcJob infrastructure to do immigrations of completed jobs. In my AEP, this is also exactly what I implemented, but then there is a flag on the engine such that it knows that it concerns an immigration and not a normal job. This additional information will allow us to switch between a info/warning of the detailed job info not being collected in the case of an immigration, or switch to an actual error in the case of an actual job. We cannot do this now do since this concept of immigration hasn't officially been implemented yet.

ramirezfranciscof · 2021-07-08T09:08:50Z

(1) Ok, from getting back to this, I understand that the solution proposed would be to do something like this, correct?

    async def do_retrieve():
        with transport_queue.request_transport(authinfo) as request:
            transport = await cancellable.with_interrupt(request)

            # Perform the job accounting and set it on the node if successful. If the scheduler does not implement this
            # still set the attribute but set it to `None`. This way we can distinguish calculation jobs for which the
            # accounting was called but could not be set.
            scheduler = node.computer.get_scheduler()  # type: ignore[union-attr]
            scheduler.set_transport(transport)

#---------- NEW START
            if node.get_job_id() is None:
                logger.warning(f'there is no job id for retrieving CalcJob<{node.pk}>: skipping get_detailed_job_info')
                return execmanager.retrieve_calculation(node, transport, retrieved_temporary_folder)
#---------- NEW END

            try:
                detailed_job_info = scheduler.get_detailed_job_info(node.get_job_id())
            except FeatureNotAvailable:
                logger.info(f'detailed job info not available for scheduler of CalcJob<{node.pk}>')
                node.set_detailed_job_info(None)
            else:
                node.set_detailed_job_info(detailed_job_info)

            return execmanager.retrieve_calculation(node, transport, retrieved_temporary_folder)

(2) I still feel like this is a problem, no?

In an instance of SGE scheduler, when get_detailed_job_info(self, job_id) is called with job_id is None, the aiida daemon gets stuck at Waiting for transport task: retrieve.

In the sense that there is something down the pipeline that does not know how to react to some unexpected input and gets the process stuck instead of, for example, raising some error with explanations. I'm not sure if the hang is inside get_detailed_job_info or something else that receives that output. I would anyways set a default behavior for receiving None as job_id in get_detailed_job_info, even if it is some error raise.

sphuber · 2021-07-08T10:03:07Z

In the sense that there is something down the pipeline that does not know how to react to some unexpected input and gets the process stuck instead of, for example, raising some error with explanations.

It does raise an error with explanation though. The process doesn't get "stuck" it is just that the exception that gets raised in the context of a transport task and so the EBM kicks in and tries to redo it. In this case, this is not a transient problem and after a few retries the process will simply be paused. If you run verdi process report it will show exactly what the exception was, so I don't think there is a problem here. The added conditional that you propose is therefore all that is really needed. Of course you can still add additional checks in the get_detailed_job_info as well, to check for None explicitly and raise an error, but I don't think that is necessary.

ramirezfranciscof · 2021-07-08T10:16:35Z

Ah, I had understood from the original issue that the process just got stuck in that state and didn't report anything (the EBM should give up after 5 attempts no? I didn't get this from the quoted text).

I'll change this PR then.

codecov · 2021-07-08T10:42:08Z

Codecov Report

Merging #4967 (37fecb8) into develop (da179dc) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #4967      +/-   ##
===========================================
+ Coverage    80.12%   80.12%   +0.01%     
===========================================
  Files          515      515              
  Lines        36742    36745       +3     
===========================================
+ Hits         29435    29438       +3     
  Misses        7307     7307

Flag	Coverage Δ
django	`74.60% <100.00%> (-<0.01%)`	⬇️
sqlalchemy	`73.53% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/engine/processes/calcjobs/tasks.py	`65.78% <100.00%> (+0.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da179dc...37fecb8. Read the comment docs.

In the retrieve task of `CalcJobs` the `get_detailed_job_info` was called always. This would lead to problems if the node did not have an associated job id. Normally this doesn't happen because without a job id the engine would not even have been able to confirm that the job was ready for retrieval, however, this can happen in artificial sitations where the whole calcjob process is mocked and for example an already completed job is passed through the system. When there is no job id, the `get_detailed_job_info` method should not be called because it requires the job id to get any information. Without it, the method would except and since it is called within the exponential backoff mechanism, the job would get stuck in the paused state.

sphuber

Thanks @ramirezfranciscof .

In the retrieve task of `CalcJobs` the `get_detailed_job_info` was called always. This would lead to problems if the node did not have an associated job id. Normally this doesn't happen because without a job id the engine would not even have been able to confirm that the job was ready for retrieval, however, this can happen in artificial situations where the whole calcjob process is mocked and for example an already completed job is passed through the system. When there is no job id, the `get_detailed_job_info` method should not be called because it requires the job id to get any information. Without it, the method would except and since it is called within the exponential backoff mechanism, the job would get stuck in the paused state. Cherry-pick: 2d51386

atztogo mentioned this pull request Jul 4, 2021

On behaviour of get_detailed_job_info #4752

Closed

2 tasks

ramirezfranciscof force-pushed the sgefix branch 2 times, most recently from f64b2dd to 4b09bff Compare July 8, 2021 10:21

ramirezfranciscof requested a review from sphuber July 8, 2021 12:50

sphuber force-pushed the sgefix branch from 4b09bff to 37fecb8 Compare July 21, 2021 07:46

sphuber approved these changes Jul 21, 2021

View reviewed changes

sphuber merged commit 2d51386 into aiidateam:develop Jul 21, 2021

sphuber deleted the sgefix branch July 21, 2021 08:07

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: daemon hang when passing `None` as `job_id` #4967

FIX: daemon hang when passing `None` as `job_id` #4967

ramirezfranciscof commented Jun 1, 2021

dev-zero commented Jun 1, 2021 •

edited

Loading

sphuber commented Jun 1, 2021 •

edited

Loading

giovannipizzi commented Jun 2, 2021

sphuber commented Jun 3, 2021 •

edited

Loading

ramirezfranciscof commented Jul 8, 2021

sphuber commented Jul 8, 2021

ramirezfranciscof commented Jul 8, 2021

codecov bot commented Jul 8, 2021 •

edited

Loading

sphuber left a comment

FIX: daemon hang when passing None as job_id #4967

FIX: daemon hang when passing None as job_id #4967

Conversation

ramirezfranciscof commented Jun 1, 2021

The behaviour fix

Unit testing an abstract class

dev-zero commented Jun 1, 2021 • edited Loading

sphuber commented Jun 1, 2021 • edited Loading

giovannipizzi commented Jun 2, 2021

sphuber commented Jun 3, 2021 • edited Loading

ramirezfranciscof commented Jul 8, 2021

sphuber commented Jul 8, 2021

ramirezfranciscof commented Jul 8, 2021

codecov bot commented Jul 8, 2021 • edited Loading

Codecov Report

sphuber left a comment

Choose a reason for hiding this comment

FIX: daemon hang when passing `None` as `job_id` #4967

FIX: daemon hang when passing `None` as `job_id` #4967

dev-zero commented Jun 1, 2021 •

edited

Loading

sphuber commented Jun 1, 2021 •

edited

Loading

sphuber commented Jun 3, 2021 •

edited

Loading

codecov bot commented Jul 8, 2021 •

edited

Loading