Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Use PAI DRI notebook, DRI tickets, Wiki to update document #4760

Closed
28 tasks done
hzy46 opened this issue Jul 23, 2020 · 0 comments
Closed
28 tasks done

Use PAI DRI notebook, DRI tickets, Wiki to update document #4760

hzy46 opened this issue Jul 23, 2020 · 0 comments
Assignees
Labels

Comments

@hzy46
Copy link
Contributor

hzy46 commented Jul 23, 2020

To Do:

  • Backup & remove previous Wiki
  • PAI DRI notebook review
  • Github DRI-related issues review
  • DRI tickets review
  • Github Raised by Customer issues review

Doc Refine TO DO:

  • User Manual Update
    • Add an FAQ document
      - [ ] Why job keeps waiting
      • Why job is retried or preempted
    • Add document about distributed training
  • Admin Manual Update
    • Add a document Recommended Practices to introduce our practice for admin
      • DRI practice - duties, severity definition, triage and prioritize, SLA, how to tracking issues; DevOps (i.e. issue tracking flow, etc.)
      • Team shared practices - VC and group management practices; storage for group practices, etc.;
      • Onboarding practice - onboarding guidance
    • Refine Installation Guide
      • change openpai requirement/kubespray requirement -> hardware requirement/software requirement
      • Answer the question about HA, single-node support before installation.
    • Refine Installation FAQs and Troubleshooting
      • Add solution for apt-get issue
    • Refine document about VC
      • pinned cell
    • Refine How to set up storage
      • Add document about storage-manager service (need help)
    • Change Troubleshooting to Alerting and Troubleshooting
      • Introduce to alert manager and its settings
      • Handle PAI alerts (need help)
      • V100lp issue: use a service to handle de-allocation and auto re-join
      • Cannot see utilization

Pending (features or document to be added in the future):

  • User onboarding
    • How to apply access:
      • AAD
      • Base Auth: approve flow
    • Cluster Info on Home Page (i.e. work on it for the multi-cluster feature);
  • Guidance to backup / restore data (Pending: new feature to discuss)
    • Jobs status in DB, logs in disk
    • Cfg/ user … in etcd
    • Job metrics in Prom (?)
  • Add / remove master nodes for HA
@hzy46 hzy46 self-assigned this Jul 23, 2020
@hzy46 hzy46 changed the title Use PAI DRI notebook and DRI tickets to update document Use PAI DRI notebook, DRI tickets, Wiki to update document Jul 23, 2020
This was referenced Sep 10, 2020
@hzy46 hzy46 closed this as completed Sep 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants