-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add skeleton of refactor notes #3705
Conversation
No intent to give a global design doc for refactoring, just list what problems we should concern.
|
||
## 需要解决的问题 | ||
|
||
* [为什么要引入Op](notes/why_use_op.md)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我能想到的几个具体的问题,感觉需要一个给开发人员看的faq。
- Op的infershape主要做什么,为什么需要做两次?
- compute为什么是无状态的,为什么不要在compute的时候创建 Var?
- Var主要存什么数据类型,一定要在scope中创建么,为什么
- 对于一个计算图而言,如何决定运行那一部分,是显示将net分段,还是通过依赖关系和target自动推导
- 如果有target,target应该是什么,是op还是var,为什么
- in place是什么,会出现在什么场景(例如parameter update)
- paddle构建分编译期和运行期么,如何区分,界限在哪里?(按照昨天的讨论,有两种可能性,VarDesc ==> scope 或者 带内存的tensor和不带内存的tensor),为什么这样分,有什么好处。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
计算库选型问题:
为什么时候eigen,什么场景适合使用eigen,什么场景不适合,有哪些考虑
gpu code手写kernel的话,有什么规范和要求。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好。收到。
@@ -0,0 +1,13 @@ | |||
为什么引入Op | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
觉得需要说明的问题有:
- 同一Op的不同设备(CPU、GPU)、不同数据类型(float, double)的实现是如何组织的?
- 为什么有
InferShapeContext
,ExecutionContext
两个?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,目前考虑了float类型的kernel,还没有考虑double/fp16/int8等。以及怎么选择对应的kernel,是由用户在配置网络的时候指定,还是根据用户输入数据的类型来推导出来,运行时选择
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同一Op的不同设备(CPU、GPU)、不同数据类型(float, double)的实现是如何组织的?
为什么有InferShapeContext, ExecutionContext两个?
好的。
## 需要解决的问题 | ||
|
||
* [为什么要引入Op](notes/why_use_op.md)? | ||
* [显式的Bacward图](notes/explicit_backward_in_topology.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 什么是计算图,计算图中的节点是什么,边是什么。我们需要什么样的计算图表示,control flow graph/data flow graph or sth.
|
||
* [为什么要引入Op](notes/why_use_op.md)? | ||
* [显式的Bacward图](notes/explicit_backward_in_topology.md) | ||
* [内存与GPU计算的优化](notes/optimization_for_memory_and_gpu_kernel.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 我们目前应该怎么做显存优化以及计算优化。paddle之前基于粗粒度layer的设计,有很多手动优化的策略;而现在既然是基于op的设计,那么应该选择什么样的优化策略。我们应该给出路线图,比如说现阶段手工优化为主(粗粒度的op),以后会基于计算图来做自动优化。
3 显存优化策略
4 GPU计算优化策略,包括kernel fusion,multi-stream等
* [显式的Bacward图](notes/explicit_backward_in_topology.md) | ||
* [内存与GPU计算的优化](notes/optimization_for_memory_and_gpu_kernel.md) | ||
* [更好的报错信息](notes/better_error_message.md) | ||
* [更简化的Python实现](notes/thin_python_implementation.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个其实是想说更灵活好用的用户api吧
No intent to give a global design doc for refactoring, just list what
problems we should concern.
这是重构的整体设计文档的第一步。主体是描述出重构中需要关注到的问题。这个PR只列出了提纲。可以使用这个思维导图进行review
如果有哪些话题没必要写或者需要写可以comments,每一个子问题会以后续独立的PR完善。