-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: daemon hang when passing None
as job_id
#4967
Conversation
Personally I'd rather handle the case of |
I tend to agree with @dev-zero . The Edit: I see now in the discussion of the corresponding issue that this arises because this was not an actual run. It was actually an "immigrant" run so it is faking a job and just using the machinery to retrieve the results of an already completed job (maybe even run without AiiDA). In that case the |
If indeed we can avoid to call the get_detailed_job_info at all (and in this case it would be good anyway to raise if the method is called with |
What I suggested would work just fine no? The |
(1) Ok, from getting back to this, I understand that the solution proposed would be to do something like this, correct? async def do_retrieve():
with transport_queue.request_transport(authinfo) as request:
transport = await cancellable.with_interrupt(request)
# Perform the job accounting and set it on the node if successful. If the scheduler does not implement this
# still set the attribute but set it to `None`. This way we can distinguish calculation jobs for which the
# accounting was called but could not be set.
scheduler = node.computer.get_scheduler() # type: ignore[union-attr]
scheduler.set_transport(transport)
#---------- NEW START
if node.get_job_id() is None:
logger.warning(f'there is no job id for retrieving CalcJob<{node.pk}>: skipping get_detailed_job_info')
return execmanager.retrieve_calculation(node, transport, retrieved_temporary_folder)
#---------- NEW END
try:
detailed_job_info = scheduler.get_detailed_job_info(node.get_job_id())
except FeatureNotAvailable:
logger.info(f'detailed job info not available for scheduler of CalcJob<{node.pk}>')
node.set_detailed_job_info(None)
else:
node.set_detailed_job_info(detailed_job_info)
return execmanager.retrieve_calculation(node, transport, retrieved_temporary_folder) (2) I still feel like this is a problem, no?
In the sense that there is something down the pipeline that does not know how to react to some unexpected input and gets the process stuck instead of, for example, raising some error with explanations. I'm not sure if the hang is inside |
It does raise an error with explanation though. The process doesn't get "stuck" it is just that the exception that gets raised in the context of a transport task and so the EBM kicks in and tries to redo it. In this case, this is not a transient problem and after a few retries the process will simply be paused. If you run |
Ah, I had understood from the original issue that the process just got stuck in that state and didn't report anything (the EBM should give up after 5 attempts no? I didn't get this from the quoted text). I'll change this PR then. |
f64b2dd
to
4b09bff
Compare
Codecov Report
@@ Coverage Diff @@
## develop #4967 +/- ##
===========================================
+ Coverage 80.12% 80.12% +0.01%
===========================================
Files 515 515
Lines 36742 36745 +3
===========================================
+ Hits 29435 29438 +3
Misses 7307 7307
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
In the retrieve task of `CalcJobs` the `get_detailed_job_info` was called always. This would lead to problems if the node did not have an associated job id. Normally this doesn't happen because without a job id the engine would not even have been able to confirm that the job was ready for retrieval, however, this can happen in artificial sitations where the whole calcjob process is mocked and for example an already completed job is passed through the system. When there is no job id, the `get_detailed_job_info` method should not be called because it requires the job id to get any information. Without it, the method would except and since it is called within the exponential backoff mechanism, the job would get stuck in the paused state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ramirezfranciscof .
In the retrieve task of `CalcJobs` the `get_detailed_job_info` was called always. This would lead to problems if the node did not have an associated job id. Normally this doesn't happen because without a job id the engine would not even have been able to confirm that the job was ready for retrieval, however, this can happen in artificial situations where the whole calcjob process is mocked and for example an already completed job is passed through the system. When there is no job id, the `get_detailed_job_info` method should not be called because it requires the job id to get any information. Without it, the method would except and since it is called within the exponential backoff mechanism, the job would get stuck in the paused state. Cherry-pick: 2d51386
Will fix #4752.
The behaviour fix
I implemented the catch in the
get_detailed_job_info
method of the abstractScheduler
class. If I'm not mistaken, this alone was enough to fix the problem. @atztogo can you please check that I understood correctly and this branch solves your issue?I didn't add any raising mechanism on the specific implementation of
_get_detailed_job_info_command
inside theSgeScheduler
class because it seemed too specific a fix (why are we not adding checks in the implementations of this method in other plugins? why are we not adding checks for type or explicit typing in the other methods of the plugin?), but I can open an issue to deal with this in a more holistic/complete way.Unit testing an abstract class
I also added the test for both the new and the typical behaviours of
get_detailed_job_info
. I couldn't find any solid agreed upon way of easily unit testing abstract classes. In the end I decided to go with this way of fully overriding all__abstractmethods__
because it was the simpler one for this case, although it might be less ideal for more complex situations (it was already a bit annoying having tolamdba
inline override the__init__
and mock the transport).I didn't want to define a full manual "mock" class with an implementation for each abstract method of the
Scheduler
(even if it was just apass
), and I couldn't find any satisfying explanation for the mocking features/classes/methods ofunittest
(meaning nothing that allowed me to understand it deep enough to be comfortable using it, but without having to go through the full documentation, which could be worth it but would require more time). Also I wasn't sure if it was fair play to keep adding stuff from that module or if we should be favouringpytest
only solutions for unit testing (such as subclassingTestCase
).I call upon thee higher beings of python if you have any feedback/advice on this topic ( @sphuber @dev-zero @greschd @chrisjsewell ).