Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Dshuttle integration Plan #4599

Open
8 of 15 tasks
Binyang2014 opened this issue Jun 3, 2020 · 4 comments
Open
8 of 15 tasks

Dshuttle integration Plan #4599

Binyang2014 opened this issue Jun 3, 2020 · 4 comments
Labels

Comments

@Binyang2014
Copy link
Contributor

Binyang2014 commented Jun 3, 2020

P0: Integrate with PAI

Code freeze: 9.31 Endgame: 10.12

Deploy

  • Make dshuttle as k8s PV/PVC, leverage PAI storage solution. @Binyang2014
    1. PAI service config
    2. Add alluxio.fuse into /etc/updatedb.conf
    3. Expose Dshuttle API to frontend
    4. Add dshuttle type in rest-server
    5 Refine UI display, need @yiyione help
    6. A tool to let customer preload data to dshuttle
  • Limit/Bound Dshuttle resource usage. @Binyang2014
    Memory high usage is cause by grpc flow-control issue, and can be mitigate by change default config. Seems 6GB~8GB for CSI is enough
  • Doc for Dshuttle configuration, and how to change hived config. @Binyang2014 Qianxi

Robust

  • Stable interface to upload data to Dshuttle Qianxi
  • Figure out Dshuttle failure pattern (Master should not failure) binxuan
    1. Worker down/rejoin when running jobs P0
    case: job read all data from Dshuttle/partial from Dshuttle/All from UFS.
    Expected behavior:
    - User job can continue running without any failure.
    - Rejoined worker node can serve the request.
    - Missing file will read from UFS
    2. Client daemon failure when running jobs P1
    - One fuse daemon failed will not affect other job running on same node
    3. Worker down/rejoin when preload data P1
    Expected behavior:
    - Failed worker will not block uploading process
    - Rejoined worker will continue serve the task
  • Test for running jobs with:
    1. Consume all data from UFS
    2. Consume all data from Dshuttle
    3. Partial in Dshuttle

User experience

  • Show under-file-system for end user
  • Provide e-2-e benchmark binxuan

P1

  • Integrate with scheduler to preload data to Dshuttle
  • API to show folder load percentage in DShuttle
  • Provide a suitable way to preload data, and let user know the data is available in Dshuttle
  • Provide a CLI to let user force sync meta data with UFS.
  • Cache policy improvement
  • Cross job dataLoader optimizer
  • Display Dshuttle write type (write through/write back) to end user
@scarlett2018 scarlett2018 mentioned this issue Jun 3, 2020
47 tasks
@Binyang2014
Copy link
Contributor Author

Binyang2014 commented Jun 8, 2020

image

image

Submission page for Dshuttle

@Binyang2014
Copy link
Contributor Author

image
User profile page for Dshuttle

@Binyang2014
Copy link
Contributor Author

Binyang2014 commented Aug 10, 2020

Storage Dshuttle is configured for group. It used as a fast data cache and try to speed up I/O intensive workload. It's a readonly storage. For more detail, please refer to Dshuttle doc or contact cluster amdin.

How to upload data

Dshuttle

To upload data, please make sure your data is immutable. Then upload your data to Dshuttle UFS (such as azure blob). You can find the Dshuttle UFS in the path preview.

How to use Data

By selecting team storage, the server path will be automatically mounted to path when job running. Please treat is as local folder.

@DOGEwbx
Copy link

DOGEwbx commented Aug 14, 2020

The Dshuttle failure pattern is

  1. Worker down/rejoin when preload data P1

Expected behavior:

  • Failed worker will not block uploading process
  • Rejoined worker will continue serve the task

Real behavior

  • Failed worker will not block uploading process
  • Rejoined worker will continue serve the task
  • Some of the data processed by failed worker will not preloaded to DShuttle, which will print failed command line. You will need to re-Preload to cache all of them in DShuttle

This was referenced Sep 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants