-
Notifications
You must be signed in to change notification settings - Fork 549
New RestServer Architecture: RestServer -> DB -> ApiServer; P0 Items #4761
Conversation
* fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * save * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * save * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix
value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}" | ||
- name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND | ||
value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}" | ||
- name: RECOVERY_MODE_ENABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc for all your flags? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. #Closed
- name: RECOVERY_MODE_ENABLED | ||
{% if cluster_cfg['database-controller']['recovery-mode'] %} | ||
value: "true" | ||
{% else %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If recovery-mode=false, but you found table is empty, will you still recover from ApiServer? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The recovery mode is now initializer. #Closed
- name: K8S_CONNECTION_TIMEOUT_SECOND | ||
value: "{{ cluster_cfg['database-controller']['k8s-connection-timeout-second'] }}" | ||
- name: WRITE_MERGER_CONNECTION_TIMEOUT_SECOND | ||
value: "{{ cluster_cfg['database-controller']['write-merger-connection-timeout-second'] }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And do you have a test case for the [Normal Mode], i.e. you will delete jobs which in apiserver but not in DB
In reply to: 466121708 [](ancestors = 466121708)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. #Closed
Disable FC GC: frameworkCompletedRetainSec: 2147483600 #Closed |
{}, | ||
snapshot.getAllUpdate(), | ||
addOns.getUpdate(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly init requestSynced=false, Deleted=false? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
const record = _.assign(
{requestSynced: false, apiServerDeleted: false},
snapshot.getAllUpdate(),
addOns.getUpdate(),
); #Closed
'metadata.annotations', | ||
'spec', | ||
]); | ||
frameworkRequest.spec.executionType = `${executionType.charAt(0)}${executionType.slice(1).toLowerCase()}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No lock during this RMW, so update executionType may cause other's update during this period lost.
Can you move this RMW to write merge or let write merger support things like k8s patch? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we only modify one field executionType
. So the lost-update problem won't happen.
I suggest to add a patch api in write merger in future to handle multiple-field updates. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to solve it in current PR, since we will also update tasknumber, completionpolicy etc.
If it is too hard, pls explain the reason, comment it and track in issue #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not hard. I will implement a patch interface using json merge patch
. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Given we only have a few update scenarios, your implementation does not need too complex or too general #Closed
attributes: ['snapshot'], | ||
where: {frameworkName: encodedFrameworkName}, | ||
order: [['attemptIndex', 'ASC']], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to dedup here?
Such as use the latest snapshot for the same framework and attemptIndex #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem. #Closed
const historyFramework = await databaseModel.FrameworkHistory.findOne({ | ||
attributes: ['snapshot'], | ||
where: {frameworkName: encodedFrameworkName, attemptIndex: jobAttemptIndex}, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the latest snapshot for the same framework and attemptIndex?
To judge if it is the latest snapshot, could you also store the snapshot last write time in (history) DB #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is insertedAt
and updatedAt
field in history db.
What do you mean by use the latest snapshot for the same framework and attemptIndex
? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the history DB, for the same framework name and attemptIndex, there may be many snapshots (some in running state, some in completed state), when show the retry history, we should give user the latest snapshot which is generally the completed state snapshot. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems use updatedAt or snapshot last write time is not right. We should use the snapshot generation time instead of time write to db. If so, seems we should use framework.status.transitionTime
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same framework name and attemptIndex, there is only one record in the history db.
The history is recorded when fc log outputs something like framework xxx is retried.
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the history table store more snapshots in future? Such as rescale snapshot, or it is just retry snapshot?
BTW, even if it is just retry snapshot table, framework xxx is retried.
may be produced by FC more than 1 time in some coner cases, will you override previous one, to make sure there is always only one?
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently only retry snapshots are available. For your corner case, there is no overwrite. Two records will be all recorded. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you need to get the last one #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use hash(frameworkName, attemptIndex, historyType) as primary key to solve this problem.
#Closed
But they are in the same table. The way to calculate primary key should be the same? #Closed |
I think the logic could be different as long as we have a clear definition. Here the uid only indicates an identical descriptor for snapshots in OpenPAI. #Closed |
thread[:conn].put_copy_data "#{time}\x01#{frameworkName}\x01#{attemptIndex}\x01#{historyType}\x01#{snapshot}\n" | ||
# use frameworkName + attemptIndex + historyType to generate a uid | ||
uid = Digest::MD5.hexdigest "#{frameworkName}+#{attemptIndex}+#{historyType}" | ||
thread[:conn].put_copy_data "#{time}\x01#{time}\x01#{uid}\x01#{frameworkName}\x01#{attemptIndex}\x01#{historyType}\x01#{snapshot}\n" | ||
elsif kind == "Pod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment that if duplicate, what is the behaiour here?
Crash, retry, ignore or log #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will raise an error and log the error. #Closed
OK #Closed |
requestSynced: false, | ||
}), | ||
{ where: { name: snapshot.getName() } }, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If databaseModel.Framework.update failed, will silentSynchronizeRequest still be called?
Make sure silentSynchronizeRequest is called only when DB success #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is guaranteed. await
will throw any error that happened. #Closed
synchronizeRequest(snapshot, addOns).catch(logError); | ||
} catch (err) { | ||
logError(err); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need 2 catch? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is for normal error. The other is for promise-rejected error. They are different. #Closed
if (config.retainModeEnabled) { | ||
// If database doesn't have the corresponding framework, | ||
// and retain mode is enabled | ||
// tolerate the error and create framework in database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tolerate the error and create framework in database. [](start = 13, length = 52)
"tolerate the error and create framework in database" ->
"retain the framework, i.e. do not delete it" #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sign off, pls test log flush
Review
Database Controller Test Cases
End-to-end Test
Test Jobs
A job with simple output
Case: job command
echo success
Expect: Succeed, output
success
.Stop a job
Case: job command
sleep 1h
; then stop this jobExpect: The job can be stopped.
Job with retries
Case: job command:
exit 1
; set max retry to be 10Expect: The job retry history can be viewed.
Job with docker secret
Case: Submit a job with docker auth info
Expect: The job can run successfully.
Job with priority class
Case: Submit a job with priority class
Expect: The job priority class is correct.
Job with secret
Case: Submit a job with secret info
Expect: The secret info is successfully written into job.
Bulk job
Case: Submit a job with 1000+ instances.
Expect: The job can run successfully.
Bulk job with retries
Case: Submit a job with 1000+ instances, and 50+ retries.
Expect: The job can run successfully. The job retry history can be viewed.
Test in Certain Case
Database failure
Case: submit a job; shutdown the database; try to submit another job; start the database
Expect: the first job is not affected; the second job cannot be submitted.
Database controller failure
Case: submit a job; shutdown the database controller; try to submit another job; start the database
Expect: the first job is not affected; the second job cannot be submitted.
Framework controller failure
Case: shutdown the framework controller; try to submit a job; start the framework controller
Expect: the job can be submitted, but in
WAITING
status. After the framework controller is started, its state will turn toRUNNING
.Path Test
Rest-server checks conflicts
Case: Submit a job twice
Expect: The second job submission is not allowed.
Case: Stop a non-existed job
Expect: The operation is not allowed.
(Write merger) Add/Update FR (If not equal, override and mark !requestSynced, else no-op)
Environment: Write-merger doesn't forward request; Shutdown database poller;
Case: Fake two same framework requests to write-merger.
Expect: The second request will be ignored.
(Write merger) If retain mode is off, delete frameworks which are not submitted through db.
Environment: Turn on retain mode
Case: Create a framework directly in API server;
Expect: The framework is not deleted.
Environment: Turn off retain mode
Case: Create a framework directly in API server;
Expect: The framework is deleted.
(Write merger) If not equal, override and mark as !requestSynced
Environment: Shutdown database poller
Case: Submit a job; Then change its requestGeneration in API server.
Expect: The framework is marked as
requestSynced=false
.(Database poller) If API server 404 error, mock a delete Framework
Case: Submit A job; Then shutdown watcher after the job starts; Wait until the job succeeded; Stop poller; Start watcher; Stop watcher; Delete this framework manually; Start Poller
Expect:
After watcher is restarted, the job is marked as
state=Completed
andrequestSynced=true
;After poller is restarted, it can mock a delete framework event to write merger.
Finally, the job will be marked as
apiServerDeleted=true
.Upgrade Test
Upgrade from
v1.0.y
Drop the database; Deploy a
v1.0.y
bed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.Upgrade from
v1.1.y
Drop the database; Deploy a
v1.1.y
bed, submit some jobs (must include: one running job, one completed job with retry history, and one completed job without retry history). Then upgrade it. Make sure the job information is correct, and running jobs are not affected.Stress Test