Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skeleton of refactor notes #3705

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions doc/design/notes/GPU_memory_allocation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# 显存申请相关优化

## GPU cuMalloc与CPU malloc的区别


TBD

## 显存申请速度为什么是瓶颈?

TBD

## 如何优化显存申请速度?

TBD
1 change: 1 addition & 0 deletions doc/design/notes/better_error_message.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TBD
1 change: 1 addition & 0 deletions doc/design/notes/device_multi_devices_and_multi_nodes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TBD
10 changes: 10 additions & 0 deletions doc/design/notes/explicit_backward_in_topology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# 显式Backward图

## Backward的两种风格

TBD


## 显式生成Backward图的优缺点

TBD
14 changes: 14 additions & 0 deletions doc/design/notes/optimization_for_memory_and_gpu_kernel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# 内存与GPU计算的优化

## 问题的产生原因


TBD

## 内存优化

TBD

## 计算优化

TBD
29 changes: 29 additions & 0 deletions doc/design/notes/sequence_related.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# 序列信息相关

## 什么是序列信息?

TBD

## Op如何处理序列信息

TBD

### 非时序Op

TBD

### 时序Op

TBD

#### Convolution

TBD

#### Pooling

TBD

#### RNN

TBD
9 changes: 9 additions & 0 deletions doc/design/notes/single_process_multi_networks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# 单进程中训练多个模型

## 应用场景

TBD

## 对框架的要求

TBD
13 changes: 13 additions & 0 deletions doc/design/notes/sparse_related.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# 稀疏相关问题

## 为什么必须支持稀疏

TBD

## 稀疏数据的类型与内存表示

TBD

## 稀疏数据的优化问题

TBD
10 changes: 10 additions & 0 deletions doc/design/notes/thin_python_implementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# 薄Python实现层

## Paddle目前的Python实现问题

TBD


## 其他框架的处理方式

TBD
13 changes: 13 additions & 0 deletions doc/design/notes/why_use_op.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
为什么引入Op

Copy link
Contributor

@qingqing01 qingqing01 Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

觉得需要说明的问题有:

  1. 同一Op的不同设备(CPU、GPU)、不同数据类型(float, double)的实现是如何组织的?
  2. 为什么有InferShapeContext, ExecutionContext两个?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,目前考虑了float类型的kernel,还没有考虑double/fp16/int8等。以及怎么选择对应的kernel,是由用户在配置网络的时候指定,还是根据用户输入数据的类型来推导出来,运行时选择

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同一Op的不同设备(CPU、GPU)、不同数据类型(float, double)的实现是如何组织的?
为什么有InferShapeContext, ExecutionContext两个?

好的。

# Op与Layer的区别

TBD

# 引入Op解决的问题

TBD

# 引入Op产生的问题

TBD
24 changes: 24 additions & 0 deletions doc/design/refactor_notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# 重构的设计与笔记

## 基本原则

### 简单

TBD

### 规范

TBD

## 需要解决的问题

* [为什么要引入Op](notes/why_use_op.md)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我能想到的几个具体的问题,感觉需要一个给开发人员看的faq。

  1. Op的infershape主要做什么,为什么需要做两次?
  2. compute为什么是无状态的,为什么不要在compute的时候创建 Var?
  3. Var主要存什么数据类型,一定要在scope中创建么,为什么
  4. 对于一个计算图而言,如何决定运行那一部分,是显示将net分段,还是通过依赖关系和target自动推导
  5. 如果有target,target应该是什么,是op还是var,为什么
  6. in place是什么,会出现在什么场景(例如parameter update)
  7. paddle构建分编译期和运行期么,如何区分,界限在哪里?(按照昨天的讨论,有两种可能性,VarDesc ==> scope 或者 带内存的tensor和不带内存的tensor),为什么这样分,有什么好处。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

计算库选型问题:
为什么时候eigen,什么场景适合使用eigen,什么场景不适合,有哪些考虑
gpu code手写kernel的话,有什么规范和要求。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好。收到。

* [显式的Bacward图](notes/explicit_backward_in_topology.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 什么是计算图,计算图中的节点是什么,边是什么。我们需要什么样的计算图表示,control flow graph/data flow graph or sth.

* [内存与GPU计算的优化](notes/optimization_for_memory_and_gpu_kernel.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 我们目前应该怎么做显存优化以及计算优化。paddle之前基于粗粒度layer的设计,有很多手动优化的策略;而现在既然是基于op的设计,那么应该选择什么样的优化策略。我们应该给出路线图,比如说现阶段手工优化为主(粗粒度的op),以后会基于计算图来做自动优化。
3 显存优化策略
4 GPU计算优化策略,包括kernel fusion,multi-stream等

* [更好的报错信息](notes/better_error_message.md)
* [更简化的Python实现](notes/thin_python_implementation.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个其实是想说更灵活好用的用户api吧

* [设备,多设备与多机](notes/device_multi_devices_and_multi_nodes.md)
* [稀疏矩阵](notes/sparse_related.md)
* [序列信息](notes/sequence_related.md)
* [单进程中训练多个模型](notes/single_process_multi_networks.md)
* [显存申请相关优化](notes/GPU_memory_allocation.md)