-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Add job group id as a primary key to job group related tables #13985
Conversation
e038f55
to
f2b569c
Compare
@jigold https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-operations.html#online-ddl-primary-key-operations
So, replacing the primary key can be done inplace while allowing data manipulation on the table. It sounds like we should use the syntax:
|
f2b569c
to
f706a04
Compare
Tests are passing now!!!! Ready for review. Remember this is stacked on another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was less intense than I thought it was going to be, this is looking great! I think I need a bit more context into the batch_updates
but that was the only thing that threw me.
batch/sql/estimated-current.sql
Outdated
@@ -229,6 +229,8 @@ CREATE TABLE IF NOT EXISTS `batch_updates` ( | |||
`token` VARCHAR(100) DEFAULT NULL, | |||
`start_job_id` INT NOT NULL, | |||
`n_jobs` INT NOT NULL, | |||
`start_job_group_id` INT NOT NULL DEFAULT 0, | |||
`n_job_groups` INT NOT NULL DEFAULT 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not yet have the full context, but I'm having a hard time thinking of the use case here of reserving chunks of job group ids during an update, so it's hard for me to think of a strong argument for introducing this complexity right away. Is there a way to defer this change for now and leave job groups out of batch updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can defer the change for now, but we're going to eventually need it for QoB. I was trying to get all database changes that are needed done here even if not used. Up to you what's easiest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why it's necessary for QoB? That might inform whether I think it's good to go now or later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right for QoB that each update will only have one job group at a time in an update. However, we'll eventually want to support being able to add multiple job groups at a time in a single update. If we don't add this now, then we'll need to do a migration later on to add these fields when we're ready to add multiple job groups.
Note that this decision and the reasoning for simplifying it is going to change #14011 and the scope of that change. If that's the case, then I need to pivot ASAP as that code already puts in all of the changes necessary. So maybe take a look at that PR first (still fixing some issues, but QoB tests are passing) because that is going to dictate what is in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer to what to put in #14011 and the MVP is going to dictate what to do in the other two places.
batch/sql/estimated-current.sql
Outdated
IF job_group_cancelled THEN | ||
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = "job group has already been cancelled"; | ||
END IF; | ||
END $$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a different design than the one that we currently have for batches, where the front end checks if a batch is cancelled and errors the request prior to inserting jobs. Why can't we follow that same approach here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is a job group can be cancelled, but the batch isn't necessarily cancelled. Again, this could happen once we start adding new job groups. I can defer the change, but it's eventually needed. I was trying to get all of the remaining database changes done at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the front end just also that the job group isn't cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We could do that, but without adding a cache, we'd be making the same query over and over again. I thought this was simpler, but we could also get rid of that complexity for now. However, thinking about this more, what happens if a job group is cancelled in the middle of inserting the bunch of jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the exact same problem that we currently have with batches (we query for the cancellation status of a batch on every operation that could update it like inserting jobs). I'd rather have two of the same problem than two different approaches. We could add a cache for both later or change the strategy for both, but I'd rather not have two different approaches at the same time. Regarding the cancellation, that's why we select the batch for update before inserting jobs, so cancellation would only happen after the jobs are entered. I presume we could do the same for job groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this trigger here. The issue is if we check whether the job group is cancelled when creating each job, we can't take a lock on the job_groups_cancelled
table for that job group if it doesn't exist. I'm not sure why we don't have this same problem with batches. Maybe we do have the same issue and should fix it separately.
record = await db.select_and_fetchone(
'''
SELECT 1
FROM job_groups_cancelled
WHERE id = %s AND job_group_id = %s
LOCK IN SHARE MODE;
''',
(batch_id, update_id, user),
)
batch/sql/estimated-current.sql
Outdated
INSERT INTO job_group_self_and_ancestors (batch_id, job_group_id, ancestor_id, level) | ||
SELECT new_ancestors.batch_id, new_ancestors.job_group_id, new_ancestors.ancestor_id, new_ancestors.level | ||
FROM ( | ||
SELECT batch_id, job_group_id, MIN(ancestor_id) AS last_known_ancestor, MAX(level) AS last_known_level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's unclear to me what "last" means. Is this closest to the leaf or the root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be from walking up the tree starting at the leaf asking what is the last known parent, grandparent, great-grandparent etc. If we reach the root job group id in this query, there's nothing to fill in. Otherwise, we need to fill in the parents of the last known ancestor for that job.
Unfortunately, this complexity is needed because we need to support adding job groups in parallel just like we do with jobs and there's no guarantee the job groups will be inserted in the same chunk as its parents and in the same order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to support adding job groups in parallel just like we do with jobs
How come? We control the client here, how many job groups would someone reasonably be making in a single request? It doesn't seem unreasonable to me if someone has many little groups (which seems like an anti-pattern given the discussions we were having) to just make those up front and then submit jobs (which presumably are many more than the groups they have) in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Siwei's use case would have benefitted from making 30K job groups. One for each transcript or gene in the genome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, and those were arbitrarily nested? I think what's odd to me is not the creation of job groups in parallel but the fact that a parent job group might not exist prior to the addition of a child job group. I'm having a hard time imagining such a deep tree all created up front that it would be a performance requirement to be able to add a child before its parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think about how jobs are created. We do that in 6-way parallelism. There's no guarantee on the order of the API calls.
await bounded_gather(
*[functools.partial(self._submit_jobs, update_id, bunch, size, progress_task)
for bunch, size in zip(byte_job_specs_bunches, bunch_sizes)
],
parallelism=6,
cancel_on_error=True,
)
0d8ba69
to
904a045
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thank you for taking the time to make those changes. I just have one last question. I might just be misunderstanding or if the answer is "that functionality is coming later" that's totally fine! I don't believe it's a matter of correctness for this change.
batch/sql/finalize-job-groups.sql
Outdated
job_group_self_and_ancestors.ancestor_id = in_job_group_id AND | ||
batch_updates.committed; | ||
|
||
INSERT INTO job_groups_cancelled (id, job_group_id) VALUES (in_batch_id, in_job_group_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also be inserting any child groups of this job group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent catch! Had this in a different branch...
INSERT INTO batches_cancelled
SELECT batch_id, job_group_id
FROM job_group_parents
WHERE batch_id = in_batch_id AND parent_id = in_job_group_id
ON DUPLICATE KEY UPDATE job_group_id = batches_cancelled.job_group_id;
For reference in case it's helpful, this is the full combined working prototype branch with all changes including nested job groups: https://github.com/jigold/hail/tree/job-groups-v3 |
01e7fae
to
9d2970d
Compare
Closing in favor of #14170 |
Stacked on #13810