Add a replay toolkit #76

Yancey1989 · 2022-01-25T02:56:35Z

To make profiling a single cluster easier, we should implement a toolkit to replay a cluster. For my preliminary idea, this toolkit includes two phases:

dump cluster args and compiler input IR with the protobuf format on disc_launch_op, users can specify the iteration with environ variable and then find the dump message on logs as the following example:

Launch the training jobs with some environment variables:
```
export BLADEDISC_REPLAY_ITERATION=1000
export BLADEDISC_REPLAY_CLUSTER=cluster_24
python train.py > train.log
...
```
Then users can find the replay logs with grep command after period of time:
```
grep "BladeDISC replay toolkit" train.log
BladeDISC replay toolkit  dumps the disc compiler input file : `/tmp/tempfile-xxxx.input`, record args file: 
`/tmp/record_args.xxx.pb`
```
execute with an executable program disc_replay_main with the nvprof profiler toolkit
```
nvprof disc_replay_main /tmp/tempfile-xxxx.input /tmp/record_args.xxx.pb
```

TODOs:

implement disc_replay_main executable program.
dump record args on tensorflow bridge site.

The text was updated successfully, but these errors were encountered:

Yancey1989 · 2022-02-10T08:57:45Z

#84 implements this feature, so I will close this issue.

Yancey1989 added the feature label Jan 25, 2022

Yancey1989 self-assigned this Jan 25, 2022

Yancey1989 mentioned this issue Jan 27, 2022

Add DISC replay toolkit #84

Merged

Yancey1989 closed this as completed Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a replay toolkit #76

Add a replay toolkit #76

Yancey1989 commented Jan 25, 2022 •

edited by qiuxiafei

Loading

Yancey1989 commented Feb 10, 2022

Add a replay toolkit #76

Add a replay toolkit #76

Comments

Yancey1989 commented Jan 25, 2022 • edited by qiuxiafei Loading

Yancey1989 commented Feb 10, 2022

Yancey1989 commented Jan 25, 2022 •

edited by qiuxiafei

Loading