Candidate ideas about PAI experience improvements #3338

qfyin · 2019-08-08T03:13:16Z

1. Major features in near iterations

1.1. A complete deployment tool (script)

including storage / data base / marketplace ... (current deployment known as minimal deployment)
For users with (virtual) machines only (no other infrastructures / components)
users could specify some machines are computing-only, while some is for system / storage
For users with Azure subscription only

1.2. Marketplace

Job sharing
- job / docker image / data ...
A simple mechanism for content delivery
- general files such as lecture notes (pdf / ppt)
- code / libraries
- examples /tut
- ...
Implementation
- storage to store files
- data base
  - name / display name / structure / type / description / key words ...
  - file paths
  - openpai plugin meta info (used for openpai plugins)

1.3. Favorite job list (star a job)

user could star a job
use ~~etcd~~ database to store the favorite job list, (refer to group list)

1.4. User Expression

users could store some key-value pairs in the cluster
users could use their expression by <% expression.<key> %> in the protocol yaml, and the rest will replace it with stored value during submission
- the expression will be replaced by current-user's value (report error if invalid key)
users could view (add/delete/update) their own expressions by authenticated api (with token) could access the express

1.5. setup local environment for `OpenPAI`

In web portal, user could get help like

pip install -U "git+https://github.com/Microsoft/pai@master#egg=openpaisdk&subdirectory=contrib/python-sdk"
pai add-cluster --alias <cluster-alias> --pai-uri <uri> --user <username> [--token <AAD/token>]

user could copy and execute these commands in the command prompt

Above pai add-cluster command will try to connect the cluster, and query the necessary information back (e.g. team wise storage, virtual clusters)
User could access the storages from a unified interface like

pai listdir pai://<cluster-alias>/<stroage-name>/path/to/dest

user get a list of accessible storages from REST api
will support necessary file-level opeartion in pyfilesystem, such as listdir, makedir(s), copy, delete
may require user to manually enable nfs clinet (mount command in windows/linux/mac)

2. Minor features in near iterations

2.1. Behavior of REST api to access job config

feedback secrets depending on whether user is job owner

2.2. job-submission accept `yaml` content

2.3. Job editing experience improvement (during submission)

Customized tips by per user's job
e.g. if user use hdfs (frequently), recommend storage-plugin
Intelligence syntax and semantic checker commands parser
- in commands forms in job-submission
  e.e. spell mistakes
  e.g. add underlines to those un-created / un-mounted environment variables, path and file names, etc.

2.4. Connect to job container by one-click

in job details, add a button named ssh, clicking on which open a xterm page that would ssh login
in command line, opai job connect would support ssh (now only support connecting to jupyter server)
login-once experience, help users to handle private key

3. At some point

3.1. Documentations

Editor selected recommendation based on job profiling
e.g. (monthly) newsletters
one page cheat sheet similar to this k8s cheat sheet

3.2. job profiling

job history backup
daily (weekly) capture jobs from existing clusters and save as csv files
user behavior analysis
e.g. git / wget / curl / pip install ... dockerimage / hdfs /

3.3. debug

~~codes for algorithms debugged locally environment related codes debugged in cpu-container~~ (remote debug plugin)

4. Depends on others or in a far future

4.1. diagnostics support

stdout/stderr analysis
Runtime would get lots explicit error info
extract known error patterns and provide friendly message to users
data pipeline
e.g. wget with making directory first
check multiple data readiness at one time

4.2. Billing

4.3. Scalable deployment on Azure (AKS?)

5. Archive

5.1. Unified storage api

In pure K8s version, the storage is supposed to be diverse and taken over by admin (e.g. user may not access the authentication info of a team wise storage, the runtime will mount it in background in the job container)

Question - how to access the storage from local machine?
Solution 1 - mount nfs / samba in windows
- cons: how to be used by 3rd-party tools like nni
Solution 2 - SDK provides api wrapping for every storage type
- cons: fragmentation, maybe need to leverage file system library pyfilesystem
- cons: some storage cannot be done because of lack of authentication
Solution 3 - jump box job
- a type of consistent, long-running, low resource usage job
- data transferring service based on ssh or REST API
- cons: every data access operation requires checking or relaunching the jump box job

The text was updated successfully, but these errors were encountered:

qfyin added the proposal label Aug 8, 2019

mydmdm self-assigned this Sep 3, 2019

scarlett2018 added the pai-dev label Apr 17, 2020

scarlett2018 mentioned this issue Jun 22, 2020

OpenPAI Backlog #4512

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Candidate ideas about PAI experience improvements #3338

Candidate ideas about PAI experience improvements #3338

qfyin commented Aug 8, 2019 •

edited by mydmdm

Loading

Candidate ideas about PAI experience improvements #3338

Candidate ideas about PAI experience improvements #3338

Comments

qfyin commented Aug 8, 2019 • edited by mydmdm Loading

1. Major features in near iterations

1.1. A complete deployment tool (script)

1.2. Marketplace

1.3. Favorite job list (star a job)

1.4. User Expression

1.5. setup local environment for OpenPAI

2. Minor features in near iterations

2.1. Behavior of REST api to access job config

2.2. job-submission accept yaml content

2.3. Job editing experience improvement (during submission)

2.4. Connect to job container by one-click

3. At some point

3.1. Documentations

3.2. job profiling

3.3. debug

4. Depends on others or in a far future

4.1. diagnostics support

4.2. Billing

4.3. Scalable deployment on Azure (AKS?)

5. Archive

5.1. Unified storage api

qfyin commented Aug 8, 2019 •

edited by mydmdm

Loading

1.5. setup local environment for `OpenPAI`

2.2. job-submission accept `yaml` content