[batch] Add job group id as a primary key to job group related tables #13985

jigold · 2023-11-07T15:57:13Z

Stacked on #13810

jigold · 2023-11-09T15:08:49Z

danking · 2023-11-09T17:29:34Z

@jigold https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-operations.html#online-ddl-primary-key-operations

Operation	Instant	In Place	Rebuilds Table	Permits Concurrent DML	Only Modifies Metadata
Adding a primary key	No	Yes*	Yes*	Yes	No
Dropping a primary key	No	No	Yes	No	No
Dropping a primary key and adding another	No	Yes	Yes	Yes	No

So, replacing the primary key can be done inplace while allowing data manipulation on the table. It sounds like we should use the syntax:

ALTER TABLE tbl_name DROP PRIMARY KEY, ADD PRIMARY KEY (column), ALGORITHM=INPLACE, LOCK=NONE;

jigold · 2023-11-13T21:16:56Z

Tests are passing now!!!! Ready for review. Remember this is stacked on another PR.

daniel-goldstein

This was less intense than I thought it was going to be, this is looking great! I think I need a bit more context into the batch_updates but that was the only thing that threw me.

daniel-goldstein · 2023-11-15T15:39:40Z

batch/sql/estimated-current.sql

@@ -229,6 +229,8 @@ CREATE TABLE IF NOT EXISTS `batch_updates` (
  `token` VARCHAR(100) DEFAULT NULL,
  `start_job_id` INT NOT NULL,
  `n_jobs` INT NOT NULL,
+  `start_job_group_id` INT NOT NULL DEFAULT 0,
+  `n_job_groups` INT NOT NULL DEFAULT 1,


I might not yet have the full context, but I'm having a hard time thinking of the use case here of reserving chunks of job group ids during an update, so it's hard for me to think of a strong argument for introducing this complexity right away. Is there a way to defer this change for now and leave job groups out of batch updates?

We can defer the change for now, but we're going to eventually need it for QoB. I was trying to get all database changes that are needed done here even if not used. Up to you what's easiest.

Can you explain why it's necessary for QoB? That might inform whether I think it's good to go now or later.

You're right for QoB that each update will only have one job group at a time in an update. However, we'll eventually want to support being able to add multiple job groups at a time in a single update. If we don't add this now, then we'll need to do a migration later on to add these fields when we're ready to add multiple job groups.

Note that this decision and the reasoning for simplifying it is going to change #14011 and the scope of that change. If that's the case, then I need to pivot ASAP as that code already puts in all of the changes necessary. So maybe take a look at that PR first (still fixing some issues, but QoB tests are passing) because that is going to dictate what is in here.

I think the answer to what to put in #14011 and the MVP is going to dictate what to do in the other two places.

daniel-goldstein · 2023-11-15T15:58:41Z

batch/sql/estimated-current.sql

+  IF job_group_cancelled THEN
+    SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = "job group has already been cancelled";
+  END IF;
+END $$


This seems to be a different design than the one that we currently have for batches, where the front end checks if a batch is cancelled and errors the request prior to inserting jobs. Why can't we follow that same approach here?

The issue is a job group can be cancelled, but the batch isn't necessarily cancelled. Again, this could happen once we start adding new job groups. I can defer the change, but it's eventually needed. I was trying to get all of the remaining database changes done at once.

Why can't the front end just also that the job group isn't cancelled?

Sure. We could do that, but without adding a cache, we'd be making the same query over and over again. I thought this was simpler, but we could also get rid of that complexity for now. However, thinking about this more, what happens if a job group is cancelled in the middle of inserting the bunch of jobs?

I think this is the exact same problem that we currently have with batches (we query for the cancellation status of a batch on every operation that could update it like inserting jobs). I'd rather have two of the same problem than two different approaches. We could add a cache for both later or change the strategy for both, but I'd rather not have two different approaches at the same time. Regarding the cancellation, that's why we select the batch for update before inserting jobs, so cancellation would only happen after the jobs are entered. I presume we could do the same for job groups.

I think we need this trigger here. The issue is if we check whether the job group is cancelled when creating each job, we can't take a lock on the job_groups_cancelled table for that job group if it doesn't exist. I'm not sure why we don't have this same problem with batches. Maybe we do have the same issue and should fix it separately.

record = await db.select_and_fetchone( ''' SELECT 1 FROM job_groups_cancelled WHERE id = %s AND job_group_id = %s LOCK IN SHARE MODE; ''', (batch_id, update_id, user), )

daniel-goldstein · 2023-11-15T16:02:36Z

batch/sql/estimated-current.sql

+      INSERT INTO job_group_self_and_ancestors (batch_id, job_group_id, ancestor_id, level)
+      SELECT new_ancestors.batch_id, new_ancestors.job_group_id, new_ancestors.ancestor_id, new_ancestors.level
+      FROM (
+        SELECT batch_id, job_group_id, MIN(ancestor_id) AS last_known_ancestor, MAX(level) AS last_known_level


it's unclear to me what "last" means. Is this closest to the leaf or the root?

This would be from walking up the tree starting at the leaf asking what is the last known parent, grandparent, great-grandparent etc. If we reach the root job group id in this query, there's nothing to fill in. Otherwise, we need to fill in the parents of the last known ancestor for that job.

Unfortunately, this complexity is needed because we need to support adding job groups in parallel just like we do with jobs and there's no guarantee the job groups will be inserted in the same chunk as its parents and in the same order.

we need to support adding job groups in parallel just like we do with jobs

How come? We control the client here, how many job groups would someone reasonably be making in a single request? It doesn't seem unreasonable to me if someone has many little groups (which seems like an anti-pattern given the discussions we were having) to just make those up front and then submit jobs (which presumably are many more than the groups they have) in parallel.

Siwei's use case would have benefitted from making 30K job groups. One for each transcript or gene in the genome.

Ah ok, and those were arbitrarily nested? I think what's odd to me is not the creation of job groups in parallel but the fact that a parent job group might not exist prior to the addition of a child job group. I'm having a hard time imagining such a deep tree all created up front that it would be a performance requirement to be able to add a child before its parent.

Think about how jobs are created. We do that in 6-way parallelism. There's no guarantee on the order of the API calls.

await bounded_gather( *[functools.partial(self._submit_jobs, update_id, bunch, size, progress_task) for bunch, size in zip(byte_job_specs_bunches, bunch_sizes) ], parallelism=6, cancel_on_error=True, )

done

daniel-goldstein

This is awesome! Thank you for taking the time to make those changes. I just have one last question. I might just be misunderstanding or if the answer is "that functionality is coming later" that's totally fine! I don't believe it's a matter of correctness for this change.

daniel-goldstein · 2023-11-30T20:02:09Z

batch/sql/finalize-job-groups.sql

+      job_group_self_and_ancestors.ancestor_id = in_job_group_id AND
+      batch_updates.committed;
+
+    INSERT INTO job_groups_cancelled (id, job_group_id) VALUES (in_batch_id, in_job_group_id);


Should we also be inserting any child groups of this job group?

Excellent catch! Had this in a different branch...

INSERT INTO batches_cancelled SELECT batch_id, job_group_id FROM job_group_parents WHERE batch_id = in_batch_id AND parent_id = in_job_group_id ON DUPLICATE KEY UPDATE job_group_id = batches_cancelled.job_group_id;

jigold · 2023-11-30T20:37:31Z

For reference in case it's helpful, this is the full combined working prototype branch with all changes including nested job groups: https://github.com/jigold/hail/tree/job-groups-v3

done

added new index

jigold · 2024-02-01T19:44:44Z

Closing in favor of #14170

jigold added stacked PR WIP migration labels Nov 7, 2023

jigold assigned danking Nov 9, 2023

jigold mentioned this pull request Nov 9, 2023

[batch] Use job group id in front end and driver queries #13995

Closed

jigold force-pushed the add-job-group-keys branch 2 times, most recently from e038f55 to f2b569c Compare November 9, 2023 17:27

jigold force-pushed the add-job-group-keys branch from f2b569c to f706a04 Compare November 9, 2023 19:51

jigold marked this pull request as ready for review November 13, 2023 19:50

daniel-goldstein previously requested changes Nov 15, 2023

View reviewed changes

daniel-goldstein assigned daniel-goldstein and unassigned danking Nov 16, 2023

jigold removed the stacked PR label Nov 30, 2023

jigold force-pushed the add-job-group-keys branch from 0d8ba69 to 904a045 Compare November 30, 2023 18:22

daniel-goldstein previously requested changes Nov 30, 2023

View reviewed changes

daniel-goldstein previously approved these changes Nov 30, 2023

View reviewed changes

jigold mentioned this pull request Jan 31, 2024

[batch] Add ability to create job groups at top level #14170

Closed

jigold added 6 commits January 31, 2024 14:24

[batch] Finalize job groups in database

f1bc718

fix

e1ce544

fix foreign key constraint

0a7fd0c

dont lock primary key updates

04c6c18

fix cancel job group

998ffc0

last fix?

62e0de4

jigold added 8 commits January 31, 2024 14:30

get rid of extra complexity

a5abccf

fixup estimated-current.sql

c1ef413

fix cancel child job groups

96a3c9e

add new index

d3d7d44

add back batch updates fields

8b59c2b

add missing update_id column to job groups table

24b9193

fix foreign key constraint for update id

21afb98

remove unique key on start_job_id

9d2970d

jigold force-pushed the add-job-group-keys branch from 01e7fae to 9d2970d Compare January 31, 2024 19:31

jigold closed this Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] Add job group id as a primary key to job group related tables #13985

[batch] Add job group id as a primary key to job group related tables #13985

jigold commented Nov 7, 2023

jigold commented Nov 9, 2023

danking commented Nov 9, 2023

jigold commented Nov 13, 2023

daniel-goldstein left a comment

daniel-goldstein Nov 15, 2023 •

edited

Loading

jigold Nov 15, 2023

daniel-goldstein Nov 15, 2023

jigold Nov 15, 2023

jigold Nov 15, 2023

daniel-goldstein Nov 15, 2023

jigold Nov 15, 2023

daniel-goldstein Nov 15, 2023

jigold Nov 15, 2023 •

edited

Loading

daniel-goldstein Nov 15, 2023 •

edited

Loading

jigold Dec 1, 2023

daniel-goldstein Nov 15, 2023

jigold Nov 15, 2023 •

edited

Loading

daniel-goldstein Nov 15, 2023

jigold Nov 15, 2023

daniel-goldstein Nov 15, 2023

jigold Nov 15, 2023

daniel-goldstein left a comment •

edited

Loading

daniel-goldstein Nov 30, 2023

jigold Nov 30, 2023

jigold commented Nov 30, 2023

jigold commented Feb 1, 2024

[batch] Add job group id as a primary key to job group related tables #13985

[batch] Add job group id as a primary key to job group related tables #13985

Conversation

jigold commented Nov 7, 2023

jigold commented Nov 9, 2023

danking commented Nov 9, 2023

jigold commented Nov 13, 2023

daniel-goldstein left a comment

Choose a reason for hiding this comment

daniel-goldstein Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jigold Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

daniel-goldstein Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jigold Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jigold commented Nov 30, 2023

jigold commented Feb 1, 2024

daniel-goldstein Nov 15, 2023 •

edited

Loading

jigold Nov 15, 2023 •

edited

Loading

daniel-goldstein Nov 15, 2023 •

edited

Loading

jigold Nov 15, 2023 •

edited

Loading

daniel-goldstein left a comment •

edited

Loading