-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Rename batches tables to job groups #13810
Conversation
07681e3
to
cdeefbf
Compare
Closing for now as there's too many merge conflicts with existing PRs. |
cdeefbf
to
6d1f131
Compare
Disregard. I think the conflicting code already went in... hard to keep track of these things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything checks out to me, feels like we're so close!
I looked at this again and I think we can do this with |
Doesn't the batch driver need to schedule the |
I think we need it to be offline unless we're willing to tolerate up to 5-10 mins of not being able to cancel a batch and some alerts. The only parts that would be referencing the wrong tables are in the We can shut batch down though for the migration. Seems safest although more of a pain. |
I want as much as possible for alerts to mean something unexpected is going wrong. If we can I think this should just be done offline. |
This PR adds the job groups functionality as described in this [RFC](hail-is/hail-rfcs#5) to the Batch backend and `hailtop.batch_client`. This includes supporting nested job groups up to a maximum depth of 5. Note, that none of these changes are user-facing yet (hence no change log here). The PRs that came before this one: - #13475 - #13487 - #13810 (note that this database migration required a shutdown) Subsequent PRs will need to implement the following: - Querying job groups with the flexible query language (v2) - Implementing job groups in the Scala Client for QoB - Using job groups in QoB with `cancel_after_n_failures=1` for all new stages of worker jobs - UI functionality to page and sort through job groups - A new `hailtop.batch` interface for users to define and work with Job Groups A couple of nuances in the implementation came up that I also tried to articulate in the RFC: 1. A root job group with ID = 0 does not belong to an update ("update_id" IS NULL). This means that any checks that look for "committed" job groups need to do `(batch_updates.committed OR job_groups.job_group_id = %s)` where "%s" is the ROOT_JOB_GROUP_ID. 2. When job groups are cancelled, only the specific job group that was cancelled is inserted into `job_groups_cancelled`. This table does **NOT** contain all transitive job groups that were also cancelled indirectly. The reason for this is we cannot guarantee that a user wouldn't have millions of job groups and we can't insert millions of records inside a single SQL stored procedure. Now, any query on the driver / front_end must look up the tree and see if any parent has been cancelled. This code looks similar to the code below [1]. 3. There used to be `DELETE FROM` statements in `commit_batch_update` and `commit_batch` that cleaned up old records that were no longer used in `job_group_inst_coll_cancellable_resources` and `job_groups_inst_coll_staging`. This cleanup now occurs in a periodic loop on the driver. 4. The `job_group_inst_coll_cancellable_resources` and `job_groups_inst_coll_staging` tables have values which represent the sum of all child job groups. For example, if a job group has 1 job and it's child job group has 2 jobs, then the staging table would have n_jobs = 3 for the parent job group and n_jobs = 2 for the child job group. Likewise, all of the billing triggers and MJC have to use the `job_group_self_and_ancestors` table to modify the job group the job belongs to as well its parent job groups. [1] Code to check whether a job group has been cancelled. ```mysql SELECT job_groups.*, cancelled_t.cancelled IS NOT NULL AS cancelled FROM job_groups LEFT JOIN LATERAL ( SELECT 1 AS cancelled FROM job_group_self_and_ancestors INNER JOIN job_groups_cancelled ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND job_groups.job_group_id = job_group_self_and_ancestors.job_group_id ) AS cancelled_t ON TRUE WHERE ... ```
Stacked on #13475. This PR renames the following tables to have job groups in the name instead of batches. Note that this PR needs to shutdown the batch deployment (offline migration). I'm not 100% sure this is necessary, but I want to avoid a case where MJC of the database migration job cannot happen thus deadlocking the system.