Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

Closed
mathmanu opened this issue Jun 7, 2017 · 23 comments
Labels

Comments

@mathmanu
Copy link

mathmanu commented Jun 7, 2017

The loss comes down slower and the final accuracy is also lower. Has anyone else observed similar issue? A friend of mine had another observation that the tendency of the loss to explode to nan is higher in caffe-0.16.

The same issue exists even if I don't use CUDNN. What could be the reason?

Thanks for your help.

@CFAndy
Copy link

CFAndy commented Jun 8, 2017

The same trend on my side

@drnikolaev
Copy link

drnikolaev commented Jun 8, 2017

hi @mathmanu @ChenFengAndy
what particular nets and datasets? Do you use Python layers?

@CFAndy
Copy link

CFAndy commented Jun 9, 2017

Mine is Resnet50, no python layers.

@drnikolaev
Copy link

@ChenFengAndy do you observe the issue using muli-GPU setup? If so, do you use NVLink or straight PCIe?

@CFAndy
Copy link

CFAndy commented Jun 9, 2017

Yes. NVLINK.

@mathmanu
Copy link
Author

mathmanu commented Jun 9, 2017

I don't use NVLINK. Only PCIe, two GTX1080 cards. I had this observation on image classification and segmentation networks.

When I saw the problem I was curious whether its related to multi-GPU, so I ran training with single GPU. If I recall correctly, the trend was similar there as well - but I am not completely sure now.

@ChenFengAndy, can you start the training with one GPU and see if the trend is similar with that?

@drnikolaev
Copy link

@mathmanu @ChenFengAndy thank you. I'll need some time to verify this. So far, quick AlexNet+ImageNet+cuDNN_v6+DGX-1 comparison between 0.15 and 0.16 shows that 0.16 trains it almost two times faster. We also observe performance boost on other nets.
May I bother you to paste NVCaffe logs here (both 0.15 and 0.16)? That would help a lot.

@mathmanu
Copy link
Author

mathmanu commented Jun 9, 2017

May be there is a miscommunication. I was talking about the loss and accuracy. Not about speed.

@drnikolaev
Copy link

@mathmanu yeah, thanks for pointing to this! We actually have some accuracy&determinism improvements in the pipeline, you can give it a try here: https://github.com/drnikolaev/caffe/tree/caffe-0.16
If it's still not satisfactory please attach logs to this issue.

@mathmanu
Copy link
Author

Thanks. I am working on it.

@mathmanu
Copy link
Author

mathmanu commented Jun 15, 2017

I have attached training logs that explain this issue.
nvidia-caffe-issue-347-v1.zip
Please see the train.log files. I tried both classification and segmentation scenarios.

Following are the results:

imagenet classification - top-1 accuracy:

  • nvidia/caffe(caffe-0.15) 2-gpu: 60.89%
  • drnikolaev/caffe(caffe-0.16) 2-gpu: 57.62%

Conclusion: caffe-0.16 achieves lower classification accuracy.

cityscapes segmentation - pixel accuracy trend after 2000 iterations:

  • nvidia/caffe(caffe-0.15) 2-gpu: 90.54%
  • drnikolaev/caffe(caffe-0.16) 2-gpu: 88.20%
  • drnikolaev/caffe(caffe-0.16) 1-gpu: 89.53%

I also have (but not attached) the full training logs for some (but not all) of the above segmentation scenarios which shows lower final accuracy in caffe-0.16.

Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower.

(For segmentaion, I used a custom ImageLabelData layer - especially needed in caffe-0.15, which did not have fixed random seed for the DataLayer - source code for the new layer is also included in the attached zip file).

Let me know if you need any other information.

Btw, thankyou for all the great work that you are doing - I get about 25% speedup when using caffe-0.16.

@drnikolaev drnikolaev added the bug label Jun 15, 2017
@drnikolaev
Copy link

Hi @mathmanu thank you very much for detailed report. You are right, accuracy first and we do test it. Seems like we missed something here. Marked as a bug, work in progress...

@mathmanu
Copy link
Author

Thanks. Kindly review my ImageLabelData layer as well and let me know if I missed anything.

@mathmanu
Copy link
Author

I just noticed that the BatchNorm parameters that used for the logs that I shared are not correct for caffe-0.16 (which needs slightly different parameters).

I will correct these and give a run - but it takes too much time for me to train as I have just 2 1080s - if you could try it in your DGX1, after correcting BN params, that will be great.

I have noticed the issue even if I use the correct BN parameters.

@drnikolaev
Copy link

drnikolaev commented Jun 15, 2017

Is this similar: #276 (comment) ?
@borisgin could you have a look please?
@mathmanu sure, i'll run it tomorrow.

@mathmanu
Copy link
Author

mathmanu commented Jun 15, 2017

Hold on - I will update the results with corrected params tomorrow.

@mathmanu
Copy link
Author

I have re-run the simulations after correcting the params for new BN. The issue is very much there and the conclusions remain unchanged.

imagenet classification - top-1 accuracy:
nvidia/caffe(caffe-0.15) 2-gpu: 60.89%
drnikolaev/caffe(caffe-0.16) 2-gpu: 57.56%
Conclusion: caffe-0.16 achieves lower classification accuracy.

cityscapes segmentation - pixel accuracy trend after 2000 iterations:
nvidia/caffe(caffe-0.15) 2-gpu: 90.54%
drnikolaev/caffe(caffe-0.16) 2-gpu: 88.43%
Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower.

The logs are in train.log files in the following attachment:
nvidia-caffe-issue-347-v2.zip

Looking forward for a solution. Thanks.

@cliffwoolley
Copy link

cliffwoolley commented Jun 16, 2017 via email

@drnikolaev
Copy link

@mathmanu @ChenFengAndy - we have reproduced and fixed the issue. Thanks again for reporting it. We are working on a new release now but if you want to get early access to the fix, please clone https://github.com/drnikolaev/caffe/tree/caffe-0.16 - it's still under construction but it does produce the same accuracy as 0.15 (at least on those nets we tested so far), like this one:

0 16 fixed

@mathmanu
Copy link
Author

Great! I'll wait for the release.

@mathmanu
Copy link
Author

mathmanu commented Jun 20, 2017

As far as I understand from the fix (in BN), it only changes the output of test/validation. So if I run test with my previous model (trained in caffe-0.16 which had this bug), using the bug fixed version, i should get the expected correct accuracy - is that right?

@borisgin
Copy link

No. The bug was in the code where local learning rate was set for scale and bias in the BN layers. You have to retrain the model .

@mathmanu
Copy link
Author

Thank you. I hope the CUDNN BN will get integrated into BVLC/caffe soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants