Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoParallel] Visualize flow parallel timing diagram in static graph mode #58313

Merged
merged 60 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
c514fbd
merge from openvino master
AndSonder Oct 18, 2023
0147f70
add InterpreterRunTime() to record interpreter's run time
AndSonder Oct 20, 2023
6d1dc3d
add profiler helper static to produce json file
AndSonder Oct 20, 2023
6f4f67c
add color map and support perfetto format
AndSonder Oct 23, 2023
14fd116
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
AndSonder Oct 23, 2023
4d51610
recover codes
AndSonder Oct 23, 2023
c70d9f9
control include env for gpu_timer.h
AndSonder Oct 23, 2023
ad0f17a
fix logic for profiler_helper_static.py
AndSonder Oct 23, 2023
e0442c6
fix build error
AndSonder Oct 23, 2023
a8a37bb
fix build error
AndSonder Oct 23, 2023
a20e6ce
recover thirdparty
AndSonder Oct 23, 2023
3e10a6d
add flag control: not support new ir now
AndSonder Oct 24, 2023
59b425e
set auto_parallel_profiler flag to false
AndSonder Oct 25, 2023
ddc5038
fix
AndSonder Oct 26, 2023
14f6228
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
AndSonder Oct 26, 2023
1dfc816
add auto_parallel_profiler as command parameter
AndSonder Oct 26, 2023
9f271ef
fix value name
AndSonder Oct 26, 2023
dabf964
support gettimeofday for win env
AndSonder Oct 27, 2023
6ad6f36
fix win build error
AndSonder Oct 27, 2023
d58cc94
fix win build error
AndSonder Oct 27, 2023
e9886ae
use job_type_to_id
AndSonder Oct 27, 2023
282285b
Fixed repeatedly timing the same stream
AndSonder Oct 27, 2023
3b0db0c
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
AndSonder Oct 31, 2023
fdc3f6d
add step line for timeline
AndSonder Nov 1, 2023
1ceadc5
add step timeline and fix logic when job overlap
AndSonder Nov 2, 2023
679cc39
update time record logic
AndSonder Nov 6, 2023
8953ae9
Merge branch 'develop' into add_profiler
AndSonder Nov 6, 2023
1a04fea
fix bug when start profile start from none zero step
AndSonder Nov 7, 2023
e1c619d
fix note
AndSonder Nov 7, 2023
58c9f65
Merge branch 'add_profiler' of https://github.com/AndSonder/Paddle in…
AndSonder Nov 7, 2023
9c8b740
remove FLAGS_auto_parallel_profiler
AndSonder Nov 7, 2023
24b7e79
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
AndSonder Nov 7, 2023
63de31b
use run config instead FLAGS_auto_parallelxx
AndSonder Nov 7, 2023
8218ecb
fix color map logic
AndSonder Nov 7, 2023
4b318fc
fix color map logic
AndSonder Nov 7, 2023
9f949f2
fix bug when log step does not start from 0
AndSonder Nov 8, 2023
ffc7b39
fix
AndSonder Nov 9, 2023
1925dd7
fix
AndSonder Nov 9, 2023
d299723
don't use set_enable_auto_parallel_profiler
AndSonder Nov 9, 2023
5297b7a
fix bug
AndSonder Nov 9, 2023
8bfb6c0
disable auto_parallel_profiler when not open flag by command line
AndSonder Nov 9, 2023
13b14d1
fix bug
AndSonder Nov 9, 2023
5bb55e1
remove resettime
AndSonder Nov 10, 2023
f422b33
fix build bug
AndSonder Nov 13, 2023
ed5f7fc
fix
AndSonder Nov 13, 2023
718cf17
remove set enable
AndSonder Nov 14, 2023
f36b57b
fix build error
AndSonder Nov 15, 2023
444b7a7
fix build error
AndSonder Nov 15, 2023
f494916
fix build error
AndSonder Nov 15, 2023
28f089f
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
AndSonder Nov 15, 2023
a2b5988
fix ci error
AndSonder Nov 15, 2023
fb748d9
fix
AndSonder Nov 15, 2023
aa5570d
fix run error
AndSonder Nov 15, 2023
6b18e10
fix
AndSonder Nov 15, 2023
f096253
fix
AndSonder Nov 16, 2023
560fb61
fix calculate_stream_timer logic
AndSonder Nov 16, 2023
bbb3071
remove fluid head
AndSonder Nov 17, 2023
e15c19e
fix build error
AndSonder Nov 17, 2023
989348c
set default value for enable_job_schedule_profiler
AndSonder Nov 17, 2023
10b84d8
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
AndSonder Nov 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions paddle/fluid/framework/new_executor/interpreter_base_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@
#include "paddle/fluid/platform/device_event.h"
#include "paddle/phi/backends/device_manager.h"

#if defined(PADDLE_WITH_CUDA)
#include "paddle/phi/kernels/autotune/gpu_timer.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_timer只与类接口具体的实现方式相关,与类定义无关,应只在使用到的.cc文件中include,而不在基类头文件中include

#endif

PD_DECLARE_bool(new_executor_serial_run);
PD_DECLARE_bool(new_executor_static_build);
PD_DECLARE_bool(new_executor_use_inplace);
Expand Down Expand Up @@ -103,6 +107,8 @@ class InterpreterBaseImpl {
std::vector<paddle::framework::OpFuncNode>* op_func_nodes) = 0;

virtual bool IsStaticBuild() const = 0;

virtual std::tuple<double, double> InterpreterRunTime() = 0;
};

inline void SetDeviceId(const platform::Place& place) {
Expand Down
8 changes: 8 additions & 0 deletions paddle/fluid/framework/new_executor/interpretercore.cc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ PADDLE_DEFINE_EXPORTED_bool(new_executor_use_local_scope,
true,
"Use local_scope in new executor(especially used "
"in UT), can turn off for better performance");
PADDLE_DEFINE_EXPORTED_bool(auto_parallel_profiler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么还需要这个FLAGS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

false,
"Enable auto parallel profiler, collecting the "
"runtime of jobs in different devices");

namespace paddle {
namespace framework {
Expand Down Expand Up @@ -129,5 +133,9 @@ void InterpreterCore::Build(

bool InterpreterCore::IsStaticBuild() const { return impl_->IsStaticBuild(); }

std::tuple<double, double> InterpreterCore::InterpreterRunTime() {
return impl_->InterpreterRunTime();
}

} // namespace framework
} // namespace paddle
2 changes: 2 additions & 0 deletions paddle/fluid/framework/new_executor/interpretercore.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ class InterpreterCore {

bool IsStaticBuild() const;

std::tuple<double, double> InterpreterRunTime();

private:
DISABLE_COPY_AND_ASSIGN(InterpreterCore);

Expand Down
5 changes: 5 additions & 0 deletions paddle/fluid/framework/new_executor/new_ir_interpreter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,11 @@ void NewIRInterpreter::ShareBuildResultsFrom(const InterpreterBaseImpl& src) {
<< ") to InterpreterCore(" << this << ")";
}

std::tuple<double, double> NewIRInterpreter::InterpreterRunTime() {
PADDLE_THROW(platform::errors::Unimplemented(
"NewIRInterpreter::InterpreterRunTime is not implemented."));
}

const interpreter::NewIrDependencyBuilder&
NewIRInterpreter::GetNewIrDependencyBuilder() const {
return ir_dependency_builder_;
Expand Down
2 changes: 2 additions & 0 deletions paddle/fluid/framework/new_executor/new_ir_interpreter.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ class NewIRInterpreter : public InterpreterBaseImpl {

void ShareBuildResultsFrom(const InterpreterBaseImpl& src) override;

std::tuple<double, double> InterpreterRunTime() override;

std::shared_ptr<std::vector<size_t>> GetDependencyCount() const override;

bool IsSharedResultsBuild() const override;
Expand Down
80 changes: 80 additions & 0 deletions paddle/fluid/framework/new_executor/program_interpreter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
#include "paddle/phi/core/flags.h"
PHI_DECLARE_bool(dynamic_static_unified_comm);
#endif
PHI_DECLARE_bool(auto_parallel_profiler);

namespace paddle {
namespace framework {
Expand Down Expand Up @@ -103,6 +104,16 @@ ProgramInterpreter::~ProgramInterpreter() {
}

void ProgramInterpreter::RunImpl() {
#if defined(PADDLE_WITH_CUDA)
if (FLAGS_auto_parallel_profiler) {
// Note(sonder): Record the start time of the each stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE一般用于解释一些复杂、难以阅读的代码,或提示一些从代码中无法表达的信息。这几行代码非常简单直接,这个NOTE也只是把代码重复讲一遍,可以不需要。

for (size_t i = 0; i < stream_timers_.size(); ++i) {
auto& stream_timer = stream_timers_[i];
stream_timer.Start();
}
}
#endif

// lazy initialization of gc, do not create gc is the program only run once
if (!gc_) {
gc_ = CreateInterpreterCoreGarbageCollector(place_, vec_instruction_);
Expand All @@ -127,6 +138,15 @@ void ProgramInterpreter::RunImpl() {
platform::DeviceContextPool::Instance().Get(place_)->Wait();
}
#endif

#if defined(PADDLE_WITH_CUDA)
if (FLAGS_auto_parallel_profiler) {
for (size_t i = 0; i < stream_timers_.size(); ++i) {
auto& stream_timer = stream_timers_[i];
stream_timer.Stop();
}
}
#endif
}

FetchList ProgramInterpreter::Run(const std::vector<std::string>& feed_names,
Expand Down Expand Up @@ -622,6 +642,62 @@ void ProgramInterpreter::ClearLoDTensorArrayInLocalScope() {
}
}

void ProgramInterpreter::AddGpuStreamEvents() {
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
stream_timers_.clear();
std::vector<gpuStream_t> streams;
bool has_default_stream = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddle框架不会使用空流,不需要处理空流的情况。


for (size_t i = 0; i < vec_instruction_.size(); ++i) {
auto& instr = vec_instruction_[i];
if ((instr.KernelType() != OpFuncType::kGpuAsync) ||
(instr.DeviceContext().GetPlace().GetType() ==
phi::AllocationType::CUSTOM)) {
continue;
}

gpuStream_t stream =
reinterpret_cast<const phi::GPUContext&>(instr.DeviceContext())
.stream();

if (stream != nullptr) {
has_default_stream = true;
}
}
size_t timers_size = has_default_stream ? streams.size() + 1 : streams.size();
stream_timers_.resize(timers_size);
for (size_t i = 0; i < streams.size(); ++i) {
stream_timers_[i].SetStream(streams[i]);
}
if (has_default_stream) {
stream_timers_.back().SetStream(nullptr);
}

#endif
}

std::tuple<double, double> ProgramInterpreter::InterpreterRunTime() {
double min_start_time = std::numeric_limits<double>::max(),
max_end_time = std::numeric_limits<double>::min();
#if defined(PADDLE_WITH_CUDA)
for (size_t i = 0; i < stream_timers_.size(); ++i) {
auto& stream_timer = stream_timers_[i];
double start_time = stream_timer.StartTime();
double end_time = stream_timer.EndTime();

min_start_time = std::min(min_start_time, start_time);
max_end_time = std::max(max_end_time, end_time);

VLOG(3) << "ProgramInterpreter::InterpreterRunTime:"
<< "start_time: " << std::to_string(start_time)
<< ", end_time: " << std::to_string(end_time) << ", min_start_time"
<< std::to_string(min_start_time)
<< ", max_end_time: " << std::to_string(max_end_time);
}
#endif
return std::make_tuple(min_start_time, max_end_time);
}

void ProgramInterpreter::Convert(
std::vector<paddle::framework::OpFuncNode>* op_func_nodes) {
auto& vec_meta_info = var_scope_.MutableVecMetaInfo();
Expand Down Expand Up @@ -658,6 +734,10 @@ void ProgramInterpreter::Convert(
vec_instruction_.emplace_back(op_idx, std::move(op_func_node), *dev_ctx_);
}

if (FLAGS_auto_parallel_profiler) {
AddGpuStreamEvents();
}

BuildOperatorDependences();

// NOTE(Ruibiao): For cross-step stream synchronization, an event may be
Expand Down
8 changes: 8 additions & 0 deletions paddle/fluid/framework/new_executor/program_interpreter.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ class ProgramInterpreter : public InterpreterBaseImpl {

bool IsStaticBuild() const override { return static_build_; }

std::tuple<double, double> InterpreterRunTime() override;

private:
// build graph
void Convert(std::vector<paddle::framework::OpFuncNode>* op_func_nodes);
Expand Down Expand Up @@ -149,6 +151,8 @@ class ProgramInterpreter : public InterpreterBaseImpl {
// For log and debug
std::string GetDepsString() const;

void AddGpuStreamEvents();

bool is_build_{false};
bool static_build_{false};
// Note(sonder): share the op dependency and event analysis procedure.
Expand Down Expand Up @@ -210,6 +214,10 @@ class ProgramInterpreter : public InterpreterBaseImpl {
InstructionSchedulingPriorityLess instruction_scheduling_priority_less;

std::vector<HookFunc> hookfuncs_;

#if defined(PADDLE_WITH_CUDA)
std::vector<phi::GpuTimer> stream_timers_;
#endif
};

} // namespace framework
Expand Down
18 changes: 18 additions & 0 deletions paddle/fluid/framework/new_executor/standalone_executor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
PHI_DECLARE_bool(enable_new_ir_in_executor);
PHI_DECLARE_bool(enable_pir_api);
PHI_DECLARE_bool(new_ir_apply_inplace_pass);
PHI_DECLARE_bool(auto_parallel_profiler);

namespace paddle {
namespace framework {
Expand Down Expand Up @@ -205,6 +206,23 @@ paddle::framework::FetchList StandaloneExecutor::Run(
}
}

// record each job's run time
#if defined(PADDLE_WITH_CUDA)
if (FLAGS_auto_parallel_profiler && !FLAGS_enable_new_ir_in_executor) {
for (size_t job_idx = 0; job_idx < jobs.size(); ++job_idx) {
const auto& job = jobs[job_idx];
const std::string& job_type = job->Type();
double start_time, end_time;
std::tie(start_time, end_time) =
interpretercores_[job_idx]->InterpreterRunTime();
VLOG(0) << "Profiler Info: Job (" << job_idx << "), type = " << job_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加注释说明这个log的作用,否则其它人不了解的情况下可能错误改动

<< ", micro_batch_id = " << job->MicroBatchId()
<< ", job_start_time = " << std::to_string(start_time)
<< ", job_end_time = " << std::to_string(end_time);
}
}
#endif

// return Fetch Tensors
if (FLAGS_enable_new_ir_in_executor) {
framework::FetchList fetch_res;
Expand Down
21 changes: 20 additions & 1 deletion paddle/phi/kernels/autotune/gpu_timer.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

#pragma once

#include <sys/time.h>

#include "paddle/phi/backends/gpu/gpu_decls.h"
#include "paddle/phi/core/enforce.h"
#include "paddle/phi/core/errors.h"
Expand Down Expand Up @@ -68,7 +70,18 @@ class GpuTimer {
#endif
}

float ElapsedTime() {
void Start() {
struct timeval time_now {};
gettimeofday(&time_now, nullptr);
start_time_ = (time_now.tv_sec * 1000) + (time_now.tv_usec / 1000.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以加注释说明为何需要用CPU时间作为start_time

Start(stream_);
}

void Stop() { Stop(stream_); }

void SetStream(gpuStream_t stream) { stream_ = stream; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过SetStream设置stream再调用无参的StartStop函数,与直接调用传stream参数的StartStop参数,两者似乎是等价的。在这种情况下不建议新增一套等效的接口。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方的代码没完全更新上,已更新最新版本,还需要麻烦看一下,通过SetStream设置stream再调用无参的Start和Stop函数 中包含了 cudaStreamAddCallback 的逻辑


double ElapsedTime() {
float milliseconds = 0;
#ifdef PADDLE_WITH_HIP
hipEventSynchronize(stop_);
Expand All @@ -80,9 +93,15 @@ class GpuTimer {
return milliseconds;
}

double StartTime() { return start_time_; }

double EndTime() { return ElapsedTime() + start_time_; }

private:
gpuEvent_t start_;
gpuEvent_t stop_;
gpuStream_t stream_;
double start_time_;
};

} // namespace phi
Loading