2018 04 11

Lei Wang

[WIP] Refactor all the build related shell scripts, centralize the parameter control into one script file and run different build jobs using options.
[WIP] Write scripts to control docker container startup option.

wangkuiyi

MKLDNN code
- https://github.com/PaddlePaddle/Paddle/pull/9655#pullrequestreview-110139966
Code quality
- Unify Fluid C++ style: https://github.com/PaddlePaddle/Paddle/pull/9685
- Try to fix cpplint errors of fluid/recordio:
  - https://github.com/PaddlePaddle/Paddle/pull/9668
  - https://github.com/PaddlePaddle/Paddle/pull/9688
20 pull requests to clean up code: https://github.com/pulls?utf8=✓&q=is%3Apr+author%3Awangkuiyi

tonyyang-svail

Code Clean up: https://github.com/PaddlePaddle/Paddle/pull/9663
- Removal of NetOp, CondOp, backward.cc.
- Refactor test_batch_norm.py and test_layer_norm.py

tangwei

kexinzhao

Vgg16 imagenet on V100 GPU, total time for one batch (ms):

batch size	1	2	4	8	16
float32	14.64	10.24	23.54	28.41	53.62
float16	3.94	4.62	6.21	9.39	15.82
Speedup	3.72	2.22	3.79	3.03	3.39

Vgg16 imagenet on V100 GPU, time spent on conv op (ms):

batch size	1	2	4	8	16
float32	12.0	6.96	18.6	21.4	41.3
float16	1.81	2.11	2.95	4.57	8.0
Speedup	6.63	3.30	6.31	4.68	5.16

float16 support
- benchmark float16 inference on imagenet: https://github.com/kexinzhao/Paddle_benchmark/blob/master/float16_benchmark.md
- enable tensor core for GEMM: https://github.com/PaddlePaddle/Paddle/pull/9622
- enable tensor core for conv op: https://github.com/PaddlePaddle/Paddle/pull/9623
- add float16 support to softmax op: https://github.com/PaddlePaddle/Paddle/pull/9686
- add float16 support to activation ops: https://github.com/PaddlePaddle/Paddle/pull/9769
- fix cuda 7.5 compile error: https://github.com/PaddlePaddle/Paddle/pull/9811
- add float16 support to save op and add float16 example code: https://github.com/PaddlePaddle/Paddle/pull/9864

qiaolongfei

fluid

Fluid support Abacus
- Project: https://github.com/PaddlePaddle/Paddle/projects/56
- task list: https://github.com/PaddlePaddle/Paddle/issues/9211
Fluid implementation:
1. Dist transpiler support prefetch https://github.com/PaddlePaddle/Paddle/pull/9714
2. add insert_op for block https://github.com/PaddlePaddle/Paddle/pull/9765
Code clean&optimize
1. fix missing core.so on mac https://github.com/PaddlePaddle/Paddle/pull/9725
2. change mklml download url to bce https://github.com/PaddlePaddle/Paddle/pull/9652

Wei Xing

documents:

PR:

Add title for kernel_hint_design.md & kernel_selection.md
- https://github.com/PaddlePaddle/Paddle/pull/9788
Fix api docs display error for fluid Initializer
- https://github.com/PaddlePaddle/Paddle/pull/9786
Fix display errors for images and tables in .md file:
Fix some dead links for fluid documents
- https://github.com/PaddlePaddle/Paddle/pull/9561
Add contents for manully build documentation
- https://github.com/PaddlePaddle/Paddle/pull/9298

issue:

All deadlinks in fluid documentation
- https://github.com/PaddlePaddle/Paddle/issues/9748
Error occurs when building apis or documentation
- https://github.com/PaddlePaddle/Paddle/issues/9784

Xin Pan

Add a ParallelExecutor scheduling optimization. 4 device speedup on resnext improve 14% https://github.com/PaddlePaddle/Paddle/pull/9548
Add data feed for ParallelExecutor https://github.com/PaddlePaddle/Paddle/pull/9637
Explore distributed training codes. https://github.com/PaddlePaddle/Paddle/pull/9735
cleanup https://github.com/PaddlePaddle/Paddle/pull/9678 https://github.com/PaddlePaddle/Paddle/pull/9699 https://github.com/PaddlePaddle/Paddle/pull/9750

Dang Qingqing

ParallelExecutor
- Parallel testing during training by ParallelExecutor. https://github.com/PaddlePaddle/Paddle/pull/9738
- Support data type int64 in NCCL. https://github.com/PaddlePaddle/Paddle/pull/9818
- Improve test_parallel_executor. https://github.com/PaddlePaddle/Paddle/pull/9849
Image:
- PriorBox GPU kernel: https://github.com/PaddlePaddle/Paddle/pull/9553
- Enable ParallelExecutor in SSD-MobileNet and Refine code. https://github.com/PaddlePaddle/models/pull/832
- SE-ResNeXt with ParalleExecutor: https://github.com/PaddlePaddle/models/pull/816
- Doc for SSD: https://github.com/PaddlePaddle/models/pull/801
Others:
- Code cleanup in the profiler code. https://github.com/PaddlePaddle/Paddle/pull/9782
- https://github.com/PaddlePaddle/models/pull/821

Liu Yiqun

Inference Framework
- [Merged] Remove the use of ARCHIVE_START/END
  - https://github.com/PaddlePaddle/Paddle/pull/9844
- Test the speedup of merging the computation of batch norm op on resnet50, nearly 9% ~ 13% performance gain
- Update the documentation of inference
  - https://github.com/Xreki/Xreki.github.io/blob/master/fluid/inference/inference_support_in_fluid.md

gongweibao:

Distributed transformer:
- https://docs.google.com/spreadsheets/d/1D5Xc_TfGfMV5aKh4ZJS_b4js3Mnn06H1Po0iuECZLr4/edit#gid=0
- https://github.com/PaddlePaddle/models/pull/811
Docstring checker style:
- https://github.com/PaddlePaddle/Paddle/pull/9848
Fix debuger bugs:
- https://github.com/PaddlePaddle/Paddle/pull/9705

wanghaoshuang

Fix average_accumulate_op for parallel_executor
- https://github.com/PaddlePaddle/Paddle/pull/9852
Fix lost of LoD while splitting tensor in parallel_executor.
- https://github.com/PaddlePaddle/Paddle/pull/9824
Implement OCR CTC parallel training by parallel_executor.
- https://github.com/PaddlePaddle/models/pull/833
- https://github.com/PaddlePaddle/models/issues/836
Refine document and scripts of CTC model
- https://github.com/PaddlePaddle/models/pull/798
Review:
- https://github.com/PaddlePaddle/models/pull/824

Yan Xu

lookup remote table
- support prefetch interface on gRPC server, https://github.com/PaddlePaddle/Paddle/pull/9593
- init Table value Op, https://github.com/PaddlePaddle/Paddle/pull/9787
doc
- translate k8s dist train doc, https://github.com/PaddlePaddle/Paddle/pull/9789
review
WIP
- async update on distributed training.

guosheng

NMT:
- Decouple the program desc with batch_size in Transformer(Merged).
  - https://github.com/PaddlePaddle/models/pull/783
- Refine the inference to output special tokens optionally in Transformer(Merged).
  - https://github.com/PaddlePaddle/models/pull/809
- Remove the pad token in Transformer(Merged).
  - https://github.com/PaddlePaddle/models/pull/819
- Transformer experiments related.

yangyaming

Add plot script for Transformer
Add evaluation tools for Transformer
Add Ci for onnx converter
https://github.com/PaddlePaddle/paddle-onnx/pull/15
https://github.com/PaddlePaddle/paddle-onnx/pull/13
https://github.com/PaddlePaddle/paddle-onnx/pull/8

fengjiayi

Modify readers to fit the parallel executor:
- https://github.com/PaddlePaddle/Paddle/pull/9596
- https://github.com/PaddlePaddle/Paddle/pull/9743
[WIP] Test double buffer performance on transformer model:
- single GPU: 80.8 --> 67.9
Reviews:
- metrics: https://github.com/PaddlePaddle/Paddle/pull/9791
- updates on parallel executor:
  - https://github.com/PaddlePaddle/Paddle/pull/9838
  - https://github.com/PaddlePaddle/Paddle/pull/9774

dongzhihong

[Memory] reuse relu/sigmoid operator input variable, to save memory cost
- https://github.com/PaddlePaddle/Paddle/pull/9740
refactor metrics, add auc, detection map, evaluators
- https://github.com/PaddlePaddle/Paddle/pull/9791
migrate from benchmark to main repo
- https://github.com/PaddlePaddle/Paddle/pull/9760
migration from benchmark repo to paddle
- https://github.com/PaddlePaddle/Paddle/pull/9762

zhaochengduo

PR
- feature/Add Broadcast and Gather op handle
  - https://github.com/PaddlePaddle/Paddle/pull/9825
- Add all gather op and all reduce op
  - https://github.com/PaddlePaddle/Paddle/pull/9713
- Refine SE-ResNeXt model and use ParallelExecutor.
  - https://github.com/PaddlePaddle/models/pull/816
- Crash training, if the number of samples is less than the count of devices.
  - https://github.com/PaddlePaddle/Paddle/pull/9780
- Move reduceSum to elementwise_op_function.h
  - https://github.com/PaddlePaddle/Paddle/pull/9773
Review
- Enable ParallelExecutor in SSD-MobileNet and Refine code.
  - https://github.com/PaddlePaddle/models/pull/832
- Refine SE-ResNeXt model and use ParallelExecutor.
  - https://github.com/PaddlePaddle/models/pull/816
- Simplify DataStructure in SSAGraph
  - https://github.com/PaddlePaddle/Paddle/pull/9774
- remove net op and cond_op
  - https://github.com/PaddlePaddle/Paddle/pull/9663
- Speed/sequence expand
  - https://github.com/PaddlePaddle/Paddle/pull/9289
- fix python package have no version.py
  - https://github.com/PaddlePaddle/Paddle/pull/9807
- Add float16 support to activation ops
  - https://github.com/PaddlePaddle/Paddle/pull/9769
- Fix cpplint errors with paddle/fluid/memory
  - https://github.com/PaddlePaddle/Paddle/pull/9669
- fix test_conv2d_op when compile without cuda
  - https://github.com/PaddlePaddle/Paddle/pull/9698
- Fix CPPLint issues in tuple.h
  - https://github.com/PaddlePaddle/Paddle/pull/9670

Qingsheng Li

[WIP] Learning the NLP model word2vec and NMT
[WIP] Learning the basic logic of operator and layer
[WIP] Trying to complete the implementation of Yaming's PR
- https://github.com/PaddlePaddle/models/pull/675