Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

New RestServer Architecture: RestServer -> DB -> ApiServer; P0 Items #4761

Merged
merged 34 commits into from
Aug 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,6 @@ rest-server:
webportal:
server-port: 9286

internal-storage:
enable: true

postgresql:
enable: true

#If you want to customize the scheduling config, such add more virtual clusters or more gpu types, check:
#https://github.com/microsoft/pai/blob/master/docs/hivedscheduler/devops.md
hivedscheduler:
Expand Down
28 changes: 28 additions & 0 deletions docs/database_controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Database Controller

<center>
<img src="./images/dbc_structure.png" width="100%">
</center>

Database Controller is designed to manage job status in database and API server. To be brief, we treat records in database as the ground truth, and synchronize them to the API server.

Database Controller contains 3 main components: write merger, poller and watcher. Here is an example of job lifetime controlled by these 3 components:

1. User submits framework request to rest-server.
2. Rest-server forwards the request to write merger.
3. Write merger saves it to database, mark `synced=false`, and return.
4. User is notified the framework request is successfully created.
5. Poller finds the `synced=false` request, and synchronize it to the API server.
6. Now watcher finds the framework is created in API server. So it sends the event to write merger.
7. Write merger receives the watched event, mark `synced=true`, and update job status according to the event.
8. The job finishes. Watcher sends this event to write merger.
9. Write merger receives the watched event, mark `completed=true`, and update job status according to the event.
10. Poller finds the `completed=true` request, delete it from API server.
11. Watcher sends the delete event to write merger.
12. Write merger receives the watched event, mark `deleted=true`.

## Development

**Environment:** Node.js 8.17.0, use `yarn install` to install all dependencies under `src/`. To set environmental variables, create a `.env` file under `src`.

**Lint:** Run `npm run lint -- --fix` under `src/` or `sdk/`.
Binary file added docs/images/dbc_structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions src/database-controller/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
.git

# Directory for submitted jobs' json file and scripts
frameworklauncher/

# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# Runtime data
pids
*.pid
*.seed
*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage

# nyc test coverage
.nyc_output

# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# Bower dependency directory (https://bower.io/)
bower_components

# node-waf configuration
.lock-wscript

# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release

# Dependency directories
node_modules/
jspm_packages/

# Typescript v1 declaration files
typings/

# Optional npm cache directory
.npm

# Optional eslint cache
.eslintcache

# Optional REPL history
.node_repl_history

# Output of 'npm pack'
*.tgz

# Yarn Integrity file
.yarn-integrity

# dotenv environment variables file
.env
9 changes: 9 additions & 0 deletions src/database-controller/.editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
root = true

[*]
charset = utf-8
end_of_line = lf
indent_size = 2
indent_style = space
insert_final_newline = true
trim_trailing_whitespace = true
106 changes: 106 additions & 0 deletions src/database-controller/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
lerna-debug.log*

# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json

# Runtime data
pids
*.pid
*.seed
*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage
*.lcov

# nyc test coverage
.nyc_output

# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# Bower dependency directory (https://bower.io/)
bower_components

# node-waf configuration
.lock-wscript

# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release

# Dependency directories
node_modules/
jspm_packages/

# TypeScript v1 declaration files
typings/

# TypeScript cache
*.tsbuildinfo

# Optional npm cache directory
.npm

# Optional eslint cache
.eslintcache

# Microbundle cache
.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/

# Optional REPL history
.node_repl_history

# Output of 'npm pack'
*.tgz

# Yarn Integrity file
.yarn-integrity

# dotenv environment variables file
.env
.env.test

# parcel-bundler cache (https://parceljs.org/)
.cache

# Next.js build output
.next

# Nuxt.js build / generate output
.nuxt
dist

# Gatsby files
.cache/
# Comment in the public line in if your project uses Gatsby and *not* Next.js
# https://nextjs.org/blog/next-9-1#public-directory-support
# public

# vuepress build output
.vuepress/dist

# Serverless directories
.serverless/

# FuseBox cache
.fusebox/

# DynamoDB Local files
.dynamodb/

# TernJS port file
.tern-port

version/
3 changes: 3 additions & 0 deletions src/database-controller/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Database Controller

See [here](../../docs/database_controller.md).
12 changes: 12 additions & 0 deletions src/database-controller/build/build-pre.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

pushd $(dirname "$0") > /dev/null

mkdir -m 777 -p "../version"
cp -arf "../../../version/PAI.VERSION" "../version/"
echo `git rev-parse HEAD` > "../version/COMMIT.VERSION"

popd > /dev/null
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

FROM node:carbon

WORKDIR /database-controller

COPY ./src ./src
COPY ./sdk ./sdk
COPY ./version ./version

WORKDIR src

RUN yarn install

RUN npm install json -g
RUN json -I -f package.json -e "this.paiVersion=\"`cat ../version/PAI.VERSION`\""
RUN json -I -f package.json -e "this.paiCommitVersion=\"`cat ../version/COMMIT.VERSION`\""


CMD ["sleep", "infinity"]
29 changes: 29 additions & 0 deletions src/database-controller/config/database-controller.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

service_type: "common"

# general settings
# Log level of all logs. You can choose from error, warn, info, http, verbose, debug, and silly.
log-level: info
# Whether to enable retain mode.
# If someone submits a framework directly without accessing database, we can find the framework in write merger.
# For these frameworks, if retain mode is on, we will ignore them.
# If retain mode is off (it is the default setting), we will delete the frameworks to maintain ground-truth in database.
retain-mode: false
# The global timeout for all calls to Kubernetes API server.
k8s-connection-timeout-second: 120
Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not need manual recovery-mode if we already have auto recover in init container? #Closed

Copy link
Member

@yqwang-ms yqwang-ms Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw Note that during recover do not serve any http request just recover


In reply to: 466150363 [](ancestors = 466150363)

# The timeout for calls to write merger.
write-merger-connection-timeout-second: 120

# write merger
# The serving port for write merger.
write-merger-port: 9748
# Max connection number to database in write merger.
write-merger-max-db-connection: 50

# db poller
# Polling interval of database poller. Default value is 120.
db-poller-interval-second: 120
# Max connection number to database in write merger.
db-poller-max-db-connection: 10
39 changes: 39 additions & 0 deletions src/database-controller/config/database_controller.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env python

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

import copy
import logging


class DatabaseController(object):
def __init__(self, cluster_conf, service_conf, default_service_conf):
self.cluster_conf = cluster_conf
self.service_conf = self.merge_service_configuration(service_conf, default_service_conf)
self.logger = logging.getLogger(__name__)

@staticmethod
def merge_service_configuration(overwrite_srv_cfg, default_srv_cfg):
if overwrite_srv_cfg is None:
return default_srv_cfg
srv_cfg = default_srv_cfg.copy()
for k in overwrite_srv_cfg:
srv_cfg[k] = overwrite_srv_cfg[k]
return srv_cfg

def get_master_ip(self):
for host_conf in self.cluster_conf["machine-list"]:
if "pai-master" in host_conf and host_conf["pai-master"] == "true":
return host_conf["hostip"]

def validation_pre(self):
return True, None

def run(self):
result = copy.deepcopy(self.service_conf)
result['write-merger-url'] = 'http://{}:{}'.format(self.get_master_ip(), result['write-merger-port'])
return result

def validation_post(self, conf):
return True, None
Loading