Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Candidate ideas about PAI experience improvements #3338

Open
qfyin opened this issue Aug 8, 2019 · 0 comments
Open

Candidate ideas about PAI experience improvements #3338

qfyin opened this issue Aug 8, 2019 · 0 comments
Assignees

Comments

@qfyin
Copy link
Contributor

qfyin commented Aug 8, 2019

1. Major features in near iterations

1.1. A complete deployment tool (script)

  • including storage / data base / marketplace ... (current deployment known as minimal deployment)
  • For users with (virtual) machines only (no other infrastructures / components)
  • users could specify some machines are computing-only, while some is for system / storage
  • For users with Azure subscription only

1.2. Marketplace

  • Job sharing
    • job / docker image / data ...
  • A simple mechanism for content delivery
    • general files such as lecture notes (pdf / ppt)
    • code / libraries
    • examples /tut
    • ...
  • Implementation
    • storage to store files
    • data base
      • name / display name / structure / type / description / key words ...
      • file paths
      • openpai plugin meta info (used for openpai plugins)

1.3. Favorite job list (star a job)

  • user could star a job
  • use etcd database to store the favorite job list, (refer to group list)

1.4. User Expression

  • users could store some key-value pairs in the cluster
  • users could use their expression by <% expression.<key> %> in the protocol yaml, and the rest will replace it with stored value during submission
    • the expression will be replaced by current-user's value (report error if invalid key)
  • users could view (add/delete/update) their own expressions by authenticated api (with token) could access the express

1.5. setup local environment for OpenPAI

  • In web portal, user could get help like
pip install -U "git+https://github.com/Microsoft/pai@master#egg=openpaisdk&subdirectory=contrib/python-sdk"
pai add-cluster --alias <cluster-alias> --pai-uri <uri> --user <username> [--token <AAD/token>]

user could copy and execute these commands in the command prompt

  • Above pai add-cluster command will try to connect the cluster, and query the necessary information back (e.g. team wise storage, virtual clusters)

  • User could access the storages from a unified interface like

pai listdir pai://<cluster-alias>/<stroage-name>/path/to/dest
  • user get a list of accessible storages from REST api

  • will support necessary file-level opeartion in pyfilesystem, such as listdir, makedir(s), copy, delete

  • may require user to manually enable nfs clinet (mount command in windows/linux/mac)

2. Minor features in near iterations

2.1. Behavior of REST api to access job config

  • feedback secrets depending on whether user is job owner

2.2. job-submission accept yaml content

2.3. Job editing experience improvement (during submission)

  • Customized tips by per user's job
    e.g. if user use hdfs (frequently), recommend storage-plugin

  • Intelligence syntax and semantic checker commands parser

    • in commands forms in job-submission
      e.e. spell mistakes
      e.g. add underlines to those un-created / un-mounted environment variables, path and file names, etc.

2.4. Connect to job container by one-click

  • in job details, add a button named ssh, clicking on which open a xterm page that would ssh login
  • in command line, opai job connect would support ssh (now only support connecting to jupyter server)
  • login-once experience, help users to handle private key

3. At some point

3.1. Documentations

  • Editor selected recommendation based on job profiling
    e.g. (monthly) newsletters

  • one page cheat sheet similar to this k8s cheat sheet

3.2. job profiling

  • job history backup
    daily (weekly) capture jobs from existing clusters and save as csv files

  • user behavior analysis
    e.g. git / wget / curl / pip install ... dockerimage / hdfs /

3.3. debug

  • codes for algorithms debugged locally environment related codes debugged in cpu-container (remote debug plugin)

4. Depends on others or in a far future

4.1. diagnostics support

  • stdout/stderr analysis
    Runtime would get lots explicit error info
    extract known error patterns and provide friendly message to users

  • data pipeline
    e.g. wget with making directory first
    check multiple data readiness at one time

4.2. Billing

4.3. Scalable deployment on Azure (AKS?)

5. Archive

5.1. Unified storage api

In pure K8s version, the storage is supposed to be diverse and taken over by admin (e.g. user may not access the authentication info of a team wise storage, the runtime will mount it in background in the job container)

  • Question - how to access the storage from local machine?
  • Solution 1 - mount nfs / samba in windows
    • cons: how to be used by 3rd-party tools like nni
  • Solution 2 - SDK provides api wrapping for every storage type
    • cons: fragmentation, maybe need to leverage file system library pyfilesystem
    • cons: some storage cannot be done because of lack of authentication
  • Solution 3 - jump box job
    • a type of consistent, long-running, low resource usage job
    • data transferring service based on ssh or REST API
    • cons: every data access operation requires checking or relaunching the jump box job
@qfyin qfyin added the proposal label Aug 8, 2019
@mydmdm mydmdm self-assigned this Sep 3, 2019
@scarlett2018 scarlett2018 mentioned this issue Jun 22, 2020
5 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants