-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[24.1] Fix job paused on user defined object store #19578
[24.1] Fix job paused on user defined object store #19578
Conversation
We don't have the object_store_id until we get to this point, and we need the object_store_id to check whether the target object store is subject to quota or not.
Since they are not subject to Galaxy quotas
@@ -1510,7 +1510,7 @@ def pause(self, job=None, message=None): | |||
job = self.get_job() | |||
if message is None: | |||
message = "Execution of this dataset's job is paused" | |||
if job.state == job.states.NEW: | |||
if job.state in (job.states.NEW, job.states.QUEUED): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a queued job that is already handed over to the job scheduler be paused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but the "over-quota pause check" is done now during the enqueue
process and the state at this point is already queued
.
I'll see if I can delay the status change without other consequences until this check has been made 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up refactoring a bit more than I would have liked for a fix, but it seems reasonable since otherwise the logging will be inconsistent and tell that the job was dispatched when it wasn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a queued job that is already handed over to the job scheduler be paused?
Yes it can, at any point of a jobs' lifetime. I don't know if that means some of the refactoring can be undone ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Thanks for the clarification! I will drop the additional refactoring commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek but does that mean that the job scheduler needs to remove the job from the queue? If so should we not avoid that? Just want to understand how this works :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paused jobs won't be picked up by the handler loop, they're excluded in the ready-to-run query.
lib/galaxy/jobs/__init__.py
Outdated
@@ -1604,23 +1606,28 @@ def get_destination_configuration(self, key, default=None): | |||
|
|||
def enqueue(self): | |||
job = self.get_job() | |||
# Change to queued state before handing to worker thread so the runner won't pick it up again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the first thing to happen to avoid races between worker threads. It is a no-op state basically.
The state in c2ed250 looked good to me, I am not so sure about the remaining commits. It should be possible to write an integration test for this, I think that would be quite helpful to base any changes against. |
I will try to work on an integration test, I was unsure about setting up a user-defined object store for the test, but I will investigate more. |
ed26b64
to
c2ed250
Compare
I don't think it needs to be a user object store, I think it just needs to be a non-default object store. "scratch" should also do. This might be helpful as a starting point: https://github.com/galaxyproject/galaxy/blob/dev/test/integration/objectstore/test_selection_with_user_preferred_object_store.py#L1 |
Thanks! I found a way to use user-defined object stores by looking at some existing tests from John. I think it is important to check user-defined because this change affects those in particular, but I can add another test for non-default object stores too. |
No need for a separate test IMO, if you use user-defined or not, the important thing is to delay the over-the-quota check until we have the object store id for the job. |
Then it should be ready for review :) |
Fixes #19577
This PR moves the quota check until the point where we know the
object_store_id
of the job so we can properly check for quotas.Thanks @mvdbeek for hinting in that direction, it seems to work 👍
FixRunToolOnUserStorage.mp4
How to test the changes?
License