-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark/Integrate benchmark scripts #10707
Benchmark/Integrate benchmark scripts #10707
Conversation
… integrate_benchmark_scripts
… integrate_benchmark_scripts
benchmark/fluid/fluid_benchmark.py
Outdated
default=0.001, | ||
help='The minibatch size.') | ||
# TODO(wuyi): add this option back. | ||
# parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these
def append_nccl2_prepare(): | ||
if os.getenv("PADDLE_TRAINER_ID", None) != None: | ||
# append gen_nccl_id at the end of startup program | ||
trainer_id = int(os.getenv("PADDLE_TRAINER_ID")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the env not exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user added --update_method nccl2
and didn't provide PADDLE_TRAINER_ID
, this script will raise an error, else if the user didn't provide --update_method
it will run default as local training.
benchmark/fluid/README.md
Outdated
@@ -0,0 +1,60 @@ | |||
# Fluid Benchmark | |||
|
|||
This directory contains several models and tools that used to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several models => several model configurations.
exit(0) | ||
return loss, inference_program, adam, train_reader, test_reader, batch_acc | ||
|
||
# iters, num_samples, start_time = 0, 0, time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete these unused code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
benchmark/fluid/fluid_benchmark.py
Outdated
'--with_test', | ||
action='store_true', | ||
help='If set, test the testset during training.') | ||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be added by default?
benchmark/fluid/fluid_benchmark.py
Outdated
if args.parallel == 0: | ||
# NOTE: parallel executor use profiler interanlly | ||
if args.use_nvprof and args.device == 'GPU': | ||
with profiler.cuda_profiler("cuda_profiler.txt", 'csv') as nvprof: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, the cuda_profiler is not used anymore, I will fire a PR and delete them.
benchmark/fluid/fluid_benchmark.py
Outdated
raise Exception( | ||
"Must configure correct environments to run dist train.") | ||
train_args.extend([train_prog, startup_prog]) | ||
if args.parallel == 1 and os.getenv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel == 1 will lead to the confused meaning of thread count, how about change to another name?
benchmark/fluid/kube_gen_job.py
Outdated
# to let container set rlimit | ||
"securityContext": { | ||
"privileged": True | ||
# "capabilities": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out code
benchmark/fluid/kube_gen_job.py
Outdated
import random | ||
import os | ||
|
||
pserver = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distributed jobs have a lot of same configurations, but now these templates and environment variables messed up inside the job generative script. We can put a template yaml configuration file, and set the variable default value with arguments in submit scripts.
benchmark/fluid/kube_gen_job.py
Outdated
import os | ||
|
||
pserver = { | ||
"apiVersion": "extensions/v1beta1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, In my view, the yaml format is more concise than json format? which one do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments all done, I kept using json so that we can directly import it as in-memory data.
… integrate_benchmark_scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Job. We can launch the dist benchmark scripts on CE now.
batch_acc) | ||
print(", Test Accuracy: %f" % pass_test_acc) | ||
print("\n") | ||
# TODO(wuyi): add warmup passes to get better perf data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the skip_batch_num arguments do the trick. In my experiment, in the local machine 5-10 batches would be fine.
Integrate all benchmark python programs to one, we can then run a command like:
to start either local CPU/GPU benchmarking or distributed multi-GPU benchmarking.
In distributed mode, corresponding environment variables must be set to let workers know which role is that node.