Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

2020 July ~ Aug Release #4642

Closed
24 of 39 tasks
scarlett2018 opened this issue Jun 24, 2020 · 1 comment
Closed
24 of 39 tasks

2020 July ~ Aug Release #4642

scarlett2018 opened this issue Jun 24, 2020 · 1 comment

Comments

@scarlett2018
Copy link
Member

scarlett2018 commented Jun 24, 2020

This plan captures the work in July 2020. This is a 6 week iteration major focus on system features and improvements, and we will ship it in the beginning of August.

Release Manager

@yiyione

Endgame

Code freeze: 8.24
Scrum demo date: 8.25
Endgame & retrospective date: 8.31

Plan Items

Below is a summary of the top level plan items.

  1. DB migration New RestServer Architecture: RestServer -> DB -> ApiServer #4651 TestOwner: @debuggy
    New RestServer Arch: RestServer -> DB -> ApiServer
  1. User and Group Management TestOwner: @fanyangCS
  • P0 - @yiyione - VC && Group experience for Admin/User
  1. Error message for SKU (i.e. CPU only job errors) TestOwner: @debuggy
  1. Robustness improvement features User job write too much logs will cause disk pressure #4694 TestOwner: @abuccts
  • P0 - @Binyang2014 - PAI service POD priority; disk pressure option;
  1. HiveD improvement (aware bad nodes), bug fixes, and demo - @abuccts TestOwner: @yiyione
  • P0 bug fixes code merged, need test
  • P0 feature/hackathon demo
  1. Marketplace @debuggy / @TobeyQin TestOwner: @TobeyQin
  1. AKS engine - @abuccts
  • P0 shorten and simplify the Azure OpenPAI deployment
  • P0 Quick start for Azure, and deploy CNI issue @abuccts
  1. PAI DRI & Wiki
  1. Webportal api/v1 code update TestOwner: @debuggy
  1. Runtime image TestOwner: @hzy46

stretch goal

  1. Hybrid or multi-cluster management (SHAIIC)
  1. P1 Elastic DL and Job level scale up and scale down - @yqwang-ms
  • check with Ming/NNI and other integration scenarios
  • TODO: summarize the to-the-customer level scenarios supports
  1. rest server body size limit
  1. P2 REST server history and detail page code refactor @debuggy add new eslint rules and fix errors #4823 TestOwner: @yiyione
  1. P1 Utilization support new features @hzy46 @Binyang2014
  1. P1 Cloud autoscale - @abuccts - scale up and scale down
  • TODO: investigate whether official autoscale is fit for OpenPAI
  1. P1 (depends on DB migration and refactor item) Surfacing more backend error to users in Job Details Page (@debuggy ,@yqwang-ms) Enrich job debugging info #4649
  1. rotate-log issue @Binyang2014
  • size rotate
  • webportal get last two log file merge
  1. P1 Release process speed up @yiyione
  2. The defaulting should only be done after submission phase The defaulting should only be done after submission phase #4576
  1. P1 a image with Isomorphic supports for both CPU and GPU debug/deploy (SHAIIC)

  2. P1 DShuttle Dshuttle integration Plan #4599

  1. Following up items of DB
  • P1 archive (delete) jobs (SHAIIC) Support delete/archive jobs #4453
  • P2 Job tags and favorite jobs (SHAIIC)
  • [comments] DB replication; Master/Slave; SQL Azure; Recover method; P2 //(pending)job GC

TODO: all - review owned issues and put P0 and P1 into current plan

@scarlett2018
Copy link
Member Author

Postpone items had been moved to $4512.

@scarlett2018 scarlett2018 unpinned this issue Sep 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants