Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Rename batches tables to job groups #13810

Merged
merged 3 commits into from
Nov 30, 2023

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Oct 13, 2023

Stacked on #13475. This PR renames the following tables to have job groups in the name instead of batches. Note that this PR needs to shutdown the batch deployment (offline migration). I'm not 100% sure this is necessary, but I want to avoid a case where MJC of the database migration job cannot happen thus deadlocking the system.

RENAME TABLE batch_attributes TO job_group_attributes,
             batches_cancelled TO job_groups_cancelled,
             batches_inst_coll_staging TO job_groups_inst_coll_staging, 
             batch_inst_coll_cancellable_resources TO job_group_inst_coll_cancellable_resources, 
             aggregated_batch_resources_v2 TO aggregated_job_group_resources_v2,
             aggregated_batch_resources_v3 TO aggregated_job_group_resources_v3, 
             batches_n_jobs_in_complete_states TO job_groups_n_jobs_in_complete_states;

@jigold
Copy link
Contributor Author

jigold commented Oct 30, 2023

Closing for now as there's too many merge conflicts with existing PRs.

@jigold jigold closed this Oct 30, 2023
@jigold jigold reopened this Nov 7, 2023
@jigold jigold marked this pull request as ready for review November 7, 2023 18:33
@jigold
Copy link
Contributor Author

jigold commented Nov 9, 2023

Disregard. I think the conflicting code already went in... hard to keep track of these things.

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything checks out to me, feels like we're so close!

@jigold
Copy link
Contributor Author

jigold commented Nov 14, 2023

I looked at this again and I think we can do this with online: true. It's a quick enough migration where it shouldn't impact the driver for too long that it's trying to query the long tables. If there's a problem and the driver can't make forward progress once the database migration is done, we can just shut down the deployment to 0 replicas. Thoughts?

@daniel-goldstein
Copy link
Contributor

Doesn't the batch driver need to schedule the deploy_batch job to update itself? How can it eventually get to deploying the new / compatible code?

@jigold
Copy link
Contributor Author

jigold commented Nov 15, 2023

I think we need it to be offline unless we're willing to tolerate up to 5-10 mins of not being able to cancel a batch and some alerts. The only parts that would be referencing the wrong tables are in the Canceller and notify_batch_complete. I think scheduling and MJC would just work because we update those stored procedures and don't change the child code.

We can shut batch down though for the migration. Seems safest although more of a pain.

@daniel-goldstein
Copy link
Contributor

need it to be offline unless we're willing to tolerate up to 5-10 mins of not being able to cancel a batch and some alerts

I want as much as possible for alerts to mean something unexpected is going wrong. If we can I think this should just be done offline.

@jigold jigold removed the WIP label Nov 30, 2023
@danking danking merged commit 1dedf3c into hail-is:main Nov 30, 2023
danking pushed a commit that referenced this pull request Feb 26, 2024
This PR adds the job groups functionality as described in this
[RFC](hail-is/hail-rfcs#5) to the Batch backend
and `hailtop.batch_client`. This includes supporting nested job groups
up to a maximum depth of 5. Note, that none of these changes are
user-facing yet (hence no change log here).

The PRs that came before this one:
- #13475 
- #13487 
- #13810 (note that this database migration required a shutdown)

Subsequent PRs will need to implement the following:
- Querying job groups with the flexible query language (v2)
- Implementing job groups in the Scala Client for QoB
- Using job groups in QoB with `cancel_after_n_failures=1` for all new
stages of worker jobs
- UI functionality to page and sort through job groups
- A new `hailtop.batch` interface for users to define and work with Job
Groups

A couple of nuances in the implementation came up that I also tried to
articulate in the RFC:
1. A root job group with ID = 0 does not belong to an update
("update_id" IS NULL). This means that any checks that look for
"committed" job groups need to do `(batch_updates.committed OR
job_groups.job_group_id = %s)` where "%s" is the ROOT_JOB_GROUP_ID.
2. When job groups are cancelled, only the specific job group that was
cancelled is inserted into `job_groups_cancelled`. This table does
**NOT** contain all transitive job groups that were also cancelled
indirectly. The reason for this is we cannot guarantee that a user
wouldn't have millions of job groups and we can't insert millions of
records inside a single SQL stored procedure. Now, any query on the
driver / front_end must look up the tree and see if any parent has been
cancelled. This code looks similar to the code below [1].
3. There used to be `DELETE FROM` statements in `commit_batch_update`
and `commit_batch` that cleaned up old records that were no longer used
in `job_group_inst_coll_cancellable_resources` and
`job_groups_inst_coll_staging`. This cleanup now occurs in a periodic
loop on the driver.
4. The `job_group_inst_coll_cancellable_resources` and
`job_groups_inst_coll_staging` tables have values which represent the
sum of all child job groups. For example, if a job group has 1 job and
it's child job group has 2 jobs, then the staging table would have
n_jobs = 3 for the parent job group and n_jobs = 2 for the child job
group. Likewise, all of the billing triggers and MJC have to use the
`job_group_self_and_ancestors` table to modify the job group the job
belongs to as well its parent job groups.

[1] Code to check whether a job group has been cancelled.

```mysql
SELECT job_groups.*,
  cancelled_t.cancelled IS NOT NULL AS cancelled
FROM job_groups
LEFT JOIN LATERAL (
  SELECT 1 AS cancelled
  FROM job_group_self_and_ancestors
  INNER JOIN job_groups_cancelled
    ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND
      job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id
  WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND
    job_groups.job_group_id = job_group_self_and_ancestors.job_group_id
) AS cancelled_t ON TRUE
WHERE ...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants