Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.1] Fix job paused on user defined object store #19578

Conversation

davelopez
Copy link
Contributor

Fixes #19577

This PR moves the quota check until the point where we know the object_store_id of the job so we can properly check for quotas.

Thanks @mvdbeek for hinting in that direction, it seems to work 👍

FixRunToolOnUserStorage.mp4

How to test the changes?

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

We don't have the object_store_id until we get to this point, and we need the object_store_id to check whether the target object store is subject to quota or not.
Since they are not subject to Galaxy quotas
@@ -1510,7 +1510,7 @@ def pause(self, job=None, message=None):
job = self.get_job()
if message is None:
message = "Execution of this dataset's job is paused"
if job.state == job.states.NEW:
if job.state in (job.states.NEW, job.states.QUEUED):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a queued job that is already handed over to the job scheduler be paused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but the "over-quota pause check" is done now during the enqueue process and the state at this point is already queued.
I'll see if I can delay the status change without other consequences until this check has been made 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up refactoring a bit more than I would have liked for a fix, but it seems reasonable since otherwise the logging will be inconsistent and tell that the job was dispatched when it wasn't.

Copy link
Member

@mvdbeek mvdbeek Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a queued job that is already handed over to the job scheduler be paused?

Yes it can, at any point of a jobs' lifetime. I don't know if that means some of the refactoring can be undone ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Thanks for the clarification! I will drop the additional refactoring commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek but does that mean that the job scheduler needs to remove the job from the queue? If so should we not avoid that? Just want to understand how this works :)

Copy link
Member

@mvdbeek mvdbeek Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paused jobs won't be picked up by the handler loop, they're excluded in the ready-to-run query.

@davelopez davelopez marked this pull request as draft February 11, 2025 08:13
@davelopez davelopez marked this pull request as ready for review February 11, 2025 09:40
@@ -1604,23 +1606,28 @@ def get_destination_configuration(self, key, default=None):

def enqueue(self):
job = self.get_job()
# Change to queued state before handing to worker thread so the runner won't pick it up again
Copy link
Member

@mvdbeek mvdbeek Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be the first thing to happen to avoid races between worker threads. It is a no-op state basically.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 11, 2025

The state in c2ed250 looked good to me, I am not so sure about the remaining commits. It should be possible to write an integration test for this, I think that would be quite helpful to base any changes against.

@davelopez
Copy link
Contributor Author

I will try to work on an integration test, I was unsure about setting up a user-defined object store for the test, but I will investigate more.

@davelopez davelopez marked this pull request as draft February 11, 2025 11:37
@davelopez davelopez force-pushed the 24.1_fix_job_paused_on_user_defined_object_store branch from ed26b64 to c2ed250 Compare February 11, 2025 11:38
@mvdbeek
Copy link
Member

mvdbeek commented Feb 11, 2025

I don't think it needs to be a user object store, I think it just needs to be a non-default object store. "scratch" should also do. This might be helpful as a starting point: https://github.com/galaxyproject/galaxy/blob/dev/test/integration/objectstore/test_selection_with_user_preferred_object_store.py#L1

@davelopez
Copy link
Contributor Author

Thanks! I found a way to use user-defined object stores by looking at some existing tests from John. I think it is important to check user-defined because this change affects those in particular, but I can add another test for non-default object stores too.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 11, 2025

No need for a separate test IMO, if you use user-defined or not, the important thing is to delay the over-the-quota check until we have the object store id for the job.

@davelopez davelopez marked this pull request as ready for review February 11, 2025 14:00
@davelopez
Copy link
Contributor Author

Then it should be ready for review :)

@mvdbeek mvdbeek merged commit c3b6cbf into galaxyproject:release_24.1 Feb 11, 2025
46 of 52 checks passed
@davelopez davelopez deleted the 24.1_fix_job_paused_on_user_defined_object_store branch February 11, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants