-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inference engine related design #10198
Conversation
Have we verified the performance of using tensorrt as a sub-graph? |
We will get a benchmark next week. @panyx0718 |
The inference phase need to support some special hardware for acceleration, | ||
such as GPU, FPGA, and ARM. | ||
Special softwares power some of these hardwares and the inner states are hidden, for example, the TensorRT is released by NVidia to improve the inference performance on GPUs, it takes a computation graph as input, | ||
optimize and execute it, but the users can't directly modify its internal logics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Special softwares power some of these hardwares and the inner states are hidden. For example, TensorRT is released by NVIDIA to improve the inference performance on GPUs. It takes a computation graph as input, optimizes and executes it, while users can't directly modify its internal logic.
|
||
## Use Engines to Execute Sub-blocks | ||
|
||
Compared to Paddle Fluid, the engines covers limited number of operators and can only power several kinds of models. In other words, the engines can only support a part of Fluid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation of sub-blocks method
line 13 + some information from tensorflow/models#4028, in order to tell people why we use sub-blocks method, not directly use TensorRT.
Use Engines to Execute Sub-blocks
lind 14
...
|
||
</p> | ||
|
||
It is easy to parallelize the computation by scheduling several engines on different devices, for example, the CPU and GPU engines can be dispatched in the meantime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add .
after mentime.
We use a `with-statement` to mark the sub-block as follows. | ||
|
||
```python | ||
with infer.power_by_engine('tensorrt'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's type of infer
, ProgramDesc? Followings are current trainspiler inferface, whose parameter is a ProgramDesc.
t = fluid.InferenceTranspiler()
t.transpile(inference_transpiler_program, place)
In my mind, the interface for automatic detection mode
is:
t = fluid.InferenceTranspiler()
t.transpile(inference_transpiler_program, place, engine = 'tensorrt' )
def transpile(inference_transpiler_program, place, engine):
if engine == "tensorrt":
power_by_tensorrt_engine(inference_transpiler_program);
else:
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer
is a module.
import paddle.inference as infer
```python | ||
with infer.power_by_engine('tensorrt'): | ||
o = some_op() | ||
o = some_op() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's meaning of o = some_op()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No practical meaning, just shows that there are several operators there.
|
||
### EngineOp | ||
|
||
`EngineOp` is just a normal Fluid operator, which has an attribute called `subblock` to get the Fluid description about a sub-block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subblock->sub_block
*/ | ||
enum class DeviceType { | ||
CPU = 0, | ||
GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU=1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enum syntax just needs to set the first element, and following elements will increase automatically.
|
||
The `EngineOutputConvertOp` is similar. | ||
|
||
### Optimizer for sub-block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer->Transpiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An optimizer is not a Transpiler. It corresponds to the optimization in a compiler.
### Optimizer for sub-block | ||
|
||
```c++ | ||
// The InferenceOptimizers input a program desc and output a block desc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input a program desc, but output maybe a series of sub-block desc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input a program desc, output a program desc with several newly inserted EngineOp
with their attribute set with the sub-blocks.
// Different implementations will rewrite the original program desc by different logics. | ||
// There might be many different optimizers, such as | ||
// - CleanUselessOptimizer | ||
// - PruneOpOptimizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are CleanUselessOptimizer and PruneOpOptimizer ?
We already have prune
method of inference. see paddle\fluid\framework\prune.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I this a factory pattern of Operators is a better interface, maybe we'd better refactor those codes.
Are all of above implemented and run in C++ end?
Thus, how to use
|
The inference might have its own The Anakin and MDL team will join together to design the inference SDK, and there might be some futher designs about these issues. @luotao1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不好意思,从这个design pr和另一个code pr里,我都没有能领会这个设计的意图。视频会议一下吧。
@@ -0,0 +1,254 @@ | |||
# Utilize Engines to Accelerate Inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的engines指的是什么呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看上去是指 TensorRT?我看到后面提出一个base class,也在另一个code的PR里看到了这个base class。这是为了将来derive除了 TensorRT 之外的其他的“engine”对应的class吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorRT, Anajin, 或者其他类似自带完整优化的库
@@ -0,0 +1,254 @@ | |||
# Utilize Engines to Accelerate Inference | |||
|
|||
The inference phase need to support some special hardware for acceleration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inference phase need to support some special hardware
=>
We want to utilize DL chips to accelerate the inference of Fluid models.
fixes: #10028