vgg16 model batch size 64 OOM on old docker image #74

dzhwinter · 2018-01-24T14:34:41Z

export CUDA_VISIBLE_DEVICES=7; FLAGS_benchmark=true GLOG_vmodule=executor=2,memory=10 GLOG_v=10 GLOG_logtostderr=1 python vgg16.py --device=GPU --batch_size=64 --data_set=flowers

export CUDA_VISIBLE_DEVICES=7; FLAGS_benchmark=true GLOG_vmodule=executor=2,memory=10 GLOG_v=10 GLOG_logtostderr=1 python resnet50.py --device=GPU --batch_size=64 --data_set=flowers > log/resnet64.log 2>&1

dzhwinter · 2018-01-24T15:45:04Z

Here note some conclusions.

paddle memory optimize module really reuse a lot of memory. Both in vgg16 and resnet50.
batch_norm_grad seems a bottleneck which should be enhanced early, but this issue can not improve the framework with a big difference.
conv2d workspace_cache cost a lot of memory, should be used carefully.

dzhwinter · 2018-01-24T15:58:09Z

A lesson learned from the vgg16 model. I vaguely remember previous V2 implementation version only can support 128 batch size with 4 12G cards. When it goes to Fluid, batch size 32 make the system reach the memory peak seems reasonable.

However, a crucial fact that the mxnet only cost 7G GPU memory even with 200 layers.
https://arxiv.org/pdf/1604.06174.pdf
He only uses the dependency engine, should we go that way too?

To be more concrete, let's do some math caculate.
say we have a vgg16, batch size = 128, this goal is not unreachable.
We can mark a image with N C H W.

128 * 3 * 224 * 224.

Before do the convolution, we will do im2col. It's feature map shape equals (assume Same shape, kernel=3)

224 * 224 * (3 * 3 * 3)

This im2col only used by one image, then
memory cost will be
128 * 3 * 224 * 224 *4(float) + 224 * 224 * (3 * 3 * 3) *4 + 64(Cout) * 3 * 3 * 3 * 4 + ...

We can raughly caculate the result, it can not reach a horrific 1.5G.

dzhwinter · 2018-01-25T13:45:47Z

bisect rollback to 1.11 image/ci build, but it makes nonsense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vgg16 model batch size 64 OOM on old docker image #74

vgg16 model batch size 64 OOM on old docker image #74

dzhwinter commented Jan 24, 2018 •

edited

Loading

dzhwinter commented Jan 24, 2018

dzhwinter commented Jan 24, 2018 •

edited

Loading

dzhwinter commented Jan 25, 2018

vgg16 model batch size 64 OOM on old docker image #74

vgg16 model batch size 64 OOM on old docker image #74

Comments

dzhwinter commented Jan 24, 2018 • edited Loading

dzhwinter commented Jan 24, 2018

dzhwinter commented Jan 24, 2018 • edited Loading

dzhwinter commented Jan 25, 2018

dzhwinter commented Jan 24, 2018 •

edited

Loading

dzhwinter commented Jan 24, 2018 •

edited

Loading