Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: daemon hang when passing None as job_id #4967

Merged
merged 1 commit into from
Jul 21, 2021

Conversation

ramirezfranciscof
Copy link
Member

Will fix #4752.

The behaviour fix

I implemented the catch in the get_detailed_job_info method of the abstract Scheduler class. If I'm not mistaken, this alone was enough to fix the problem. @atztogo can you please check that I understood correctly and this branch solves your issue?

I didn't add any raising mechanism on the specific implementation of _get_detailed_job_info_command inside the SgeScheduler class because it seemed too specific a fix (why are we not adding checks in the implementations of this method in other plugins? why are we not adding checks for type or explicit typing in the other methods of the plugin?), but I can open an issue to deal with this in a more holistic/complete way.

Unit testing an abstract class

I also added the test for both the new and the typical behaviours of get_detailed_job_info. I couldn't find any solid agreed upon way of easily unit testing abstract classes. In the end I decided to go with this way of fully overriding all __abstractmethods__ because it was the simpler one for this case, although it might be less ideal for more complex situations (it was already a bit annoying having to lamdba inline override the __init__ and mock the transport).

I didn't want to define a full manual "mock" class with an implementation for each abstract method of the Scheduler (even if it was just a pass), and I couldn't find any satisfying explanation for the mocking features/classes/methods of unittest (meaning nothing that allowed me to understand it deep enough to be comfortable using it, but without having to go through the full documentation, which could be worth it but would require more time). Also I wasn't sure if it was fair play to keep adding stuff from that module or if we should be favouring pytest only solutions for unit testing (such as subclassing TestCase).

I call upon thee higher beings of python if you have any feedback/advice on this topic ( @sphuber @dev-zero @greschd @chrisjsewell ).

@dev-zero
Copy link
Contributor

dev-zero commented Jun 1, 2021

Personally I'd rather handle the case of job_id being a None before calling scheduler.get_detailed_job_info in task_retrieve_job since there one still has access to the node object itself and can possibly determine whether or not node.get_job_id() returning a None is an error or not, while get_detailed_job_info does not know about the node object at all and should only assume that job_id is valid in the context of a schedulers job id, hence one should add an assert isinstance(job_id, str), "..." in get_detailed_job_info before the call to the schedulers implementation to guarantee that it is some form of valid job id.

@sphuber
Copy link
Contributor

sphuber commented Jun 1, 2021

I tend to agree with @dev-zero . The do_retrieve inline function of task_retrieve_job should check the return value of node.get_job_id() and check for None. If this is the case, it should log a warning and say it cannot retrieve the detailed job info. It also raises the question how it was possible that the job id was None at all? What is the exact error case that you are addressing here? Where was this encountered and what were the details?

Edit: I see now in the discussion of the corresponding issue that this arises because this was not an actual run. It was actually an "immigrant" run so it is faking a job and just using the machinery to retrieve the results of an already completed job (maybe even run without AiiDA). In that case the job_id would indeed probably be None, but I think the point stands that the call to get_detailed_job_info should never even be made. Really this shows why we should have an official manner of immigrating completed jobs because this wouldn't even require a warning of the missing job id as that is to be expected in this use case.

@giovannipizzi
Copy link
Member

If indeed we can avoid to call the get_detailed_job_info at all (and in this case it would be good anyway to raise if the method is called with None as a parameter), this would be better. However, I don't know if there is a simple way to do it that would also solve the original issue?

@sphuber
Copy link
Contributor

sphuber commented Jun 3, 2021

However, I don't know if there is a simple way to do it that would also solve the original issue?

What I suggested would work just fine no? The job_id is None because the infrastructure is used to import an already completed job (this is what the VaspImmigrator is doing). The only change we need to apply is to update the retrieve task to check the return value of node.get_job_id() and if it is None it logs a warning, saying that it is not calling get_detailed_job_info because that would be pointless if there is no job id. As @dev-zero said, by handling this at the caller side, we can determine whether the job id being absent is a problem or not. In this case it is not because it is expected. It is important to note that this particular example is because the VaspImmigrator is "abusing" the CalcJob infrastructure to do immigrations of completed jobs. In my AEP, this is also exactly what I implemented, but then there is a flag on the engine such that it knows that it concerns an immigration and not a normal job. This additional information will allow us to switch between a info/warning of the detailed job info not being collected in the case of an immigration, or switch to an actual error in the case of an actual job. We cannot do this now do since this concept of immigration hasn't officially been implemented yet.

@ramirezfranciscof
Copy link
Member Author

(1) Ok, from getting back to this, I understand that the solution proposed would be to do something like this, correct?

    async def do_retrieve():
        with transport_queue.request_transport(authinfo) as request:
            transport = await cancellable.with_interrupt(request)

            # Perform the job accounting and set it on the node if successful. If the scheduler does not implement this
            # still set the attribute but set it to `None`. This way we can distinguish calculation jobs for which the
            # accounting was called but could not be set.
            scheduler = node.computer.get_scheduler()  # type: ignore[union-attr]
            scheduler.set_transport(transport)

#---------- NEW START
            if node.get_job_id() is None:
                logger.warning(f'there is no job id for retrieving CalcJob<{node.pk}>: skipping get_detailed_job_info')
                return execmanager.retrieve_calculation(node, transport, retrieved_temporary_folder)
#---------- NEW END

            try:
                detailed_job_info = scheduler.get_detailed_job_info(node.get_job_id())
            except FeatureNotAvailable:
                logger.info(f'detailed job info not available for scheduler of CalcJob<{node.pk}>')
                node.set_detailed_job_info(None)
            else:
                node.set_detailed_job_info(detailed_job_info)

            return execmanager.retrieve_calculation(node, transport, retrieved_temporary_folder)

(2) I still feel like this is a problem, no?

In an instance of SGE scheduler, when get_detailed_job_info(self, job_id) is called with job_id is None, the aiida daemon gets stuck at Waiting for transport task: retrieve.

In the sense that there is something down the pipeline that does not know how to react to some unexpected input and gets the process stuck instead of, for example, raising some error with explanations. I'm not sure if the hang is inside get_detailed_job_info or something else that receives that output. I would anyways set a default behavior for receiving None as job_id in get_detailed_job_info, even if it is some error raise.

@sphuber
Copy link
Contributor

sphuber commented Jul 8, 2021

In the sense that there is something down the pipeline that does not know how to react to some unexpected input and gets the process stuck instead of, for example, raising some error with explanations.

It does raise an error with explanation though. The process doesn't get "stuck" it is just that the exception that gets raised in the context of a transport task and so the EBM kicks in and tries to redo it. In this case, this is not a transient problem and after a few retries the process will simply be paused. If you run verdi process report it will show exactly what the exception was, so I don't think there is a problem here. The added conditional that you propose is therefore all that is really needed. Of course you can still add additional checks in the get_detailed_job_info as well, to check for None explicitly and raise an error, but I don't think that is necessary.

@ramirezfranciscof
Copy link
Member Author

Ah, I had understood from the original issue that the process just got stuck in that state and didn't report anything (the EBM should give up after 5 attempts no? I didn't get this from the quoted text).

I'll change this PR then.

@ramirezfranciscof ramirezfranciscof force-pushed the sgefix branch 2 times, most recently from f64b2dd to 4b09bff Compare July 8, 2021 10:21
@codecov
Copy link

codecov bot commented Jul 8, 2021

Codecov Report

Merging #4967 (37fecb8) into develop (da179dc) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4967      +/-   ##
===========================================
+ Coverage    80.12%   80.12%   +0.01%     
===========================================
  Files          515      515              
  Lines        36742    36745       +3     
===========================================
+ Hits         29435    29438       +3     
  Misses        7307     7307              
Flag Coverage Δ
django 74.60% <100.00%> (-<0.01%) ⬇️
sqlalchemy 73.53% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/engine/processes/calcjobs/tasks.py 65.78% <100.00%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da179dc...37fecb8. Read the comment docs.

@ramirezfranciscof ramirezfranciscof requested a review from sphuber July 8, 2021 12:50
In the retrieve task of `CalcJobs` the `get_detailed_job_info` was
called always. This would lead to problems if the node did not have an
associated job id. Normally this doesn't happen because without a job id
the engine would not even have been able to confirm that the job was
ready for retrieval, however, this can happen in artificial sitations
where the whole calcjob process is mocked and for example an already
completed job is passed through the system.

When there is no job id, the `get_detailed_job_info` method should not
be called because it requires the job id to get any information. Without
it, the method would except and since it is called within the
exponential backoff mechanism, the job would get stuck in the paused
state.
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber sphuber merged commit 2d51386 into aiidateam:develop Jul 21, 2021
@sphuber sphuber deleted the sgefix branch July 21, 2021 08:07
sphuber pushed a commit that referenced this pull request Aug 8, 2021
In the retrieve task of `CalcJobs` the `get_detailed_job_info` was
called always. This would lead to problems if the node did not have an
associated job id. Normally this doesn't happen because without a job id
the engine would not even have been able to confirm that the job was
ready for retrieval, however, this can happen in artificial situations
where the whole calcjob process is mocked and for example an already
completed job is passed through the system.

When there is no job id, the `get_detailed_job_info` method should not
be called because it requires the job id to get any information. Without
it, the method would except and since it is called within the
exponential backoff mechanism, the job would get stuck in the paused
state.

Cherry-pick: 2d51386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On behaviour of get_detailed_job_info
4 participants