Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Forever running job caused by out of sync between api server and database #5065

Open
hzy46 opened this issue Nov 9, 2020 · 1 comment
Open

Comments

@hzy46
Copy link
Contributor

hzy46 commented Nov 9, 2020

Issue Description:

Sometimes, etcd data could be broken, e.g. DRI deletes some frameworks manually, or the data are all lost. For most running jobs, their requestSynced=true. So the database controller assumes they are all synchronized with api server and won't check them any more. If their records are actually deleted in api server, these jobs will be in Running status forever. In addition, api server and database will be out of sync for these jobs.

Workaround & suggestion:

In most cases, admin/DRI should not touch framework data in API server directly. Any add/update/delete should use rest-server. If admin leaves the etcd data untouched, this issue can be avoided.

However, if this issue happens, the workaround is that:

  1. admin manually connect to the database:
apt update
apt install postgresql-client
# default user/password is root/rootpass
psql -h <PAI-master-ip> -U <user> -W openpai
  1. Set these jobs' requestSynced to requestSynced=false.
UPDATE frameworks SET "requestSynced"=false WHERE <please select the jobs>

If all the data are lost in etcd, use the following SQL sentence:

UPDATE frameworks SET "requestSynced"=false WHERE "requestSynced"=true and "apiServerDeleted"=false and "subState" != 'Completed'

Possible solutions for this problem:

  1. Provide a recover-from-database mode. If admin loses all data, he/she can manually turn this mode on.
    In this mode, we do UPDATE frameworks SET "requestSynced"=false WHERE "requestSynced"=true and "apiServerDeleted"=false and "subState" != 'Completed' for the user.

  2. When framework watcher starts, it lists all framework objects from api server. We can compare them with the frameworks in database. If we find there is any framework satifies: 1. apiServerDeleted=false 2. requestSynced=true 3. state!=Completed 4. Records in database and api server are different, or the api server record is missing, we can set its requestSynced=false.

  3. Do 2 periodically in database poller. Pro: we can handle this issue during normal time Cons: bring overhead

@hzy46
Copy link
Contributor Author

hzy46 commented Nov 9, 2020

Another problem related to out of sync:

  1. If a job is completed, someone updates its spec. It will cause it re-created in the api server.

  2. It is caused by the short-cut in merge writer.

  3. Currently, this problem is minor. Because rest-server can only update one field in job spec: set spec.executionType = 'Stop'. This will only cause the job to be stopped and deleted in api server.

We can: 1. Reject job spec modifying request after a job is completed 2. Or we can accept the request, but not sync it to api server.

@suiguoxin suiguoxin mentioned this issue Nov 16, 2020
39 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant