Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probably environment issues #6

Open
rabitwhte opened this issue Jul 29, 2019 · 4 comments
Open

Probably environment issues #6

rabitwhte opened this issue Jul 29, 2019 · 4 comments

Comments

@rabitwhte
Copy link

Hello. That's a piece of great repository and good papers as well. I'm particularly interested in using spatial configuration for landmark recognition. I wanted to run the 'spine' example, but obviously no data available right now. Therefore I decided to run the hand_xray example first to understand how thing works here. Unfortunately, there are some small issues, some of which I manage to overcome, but the one with data format stopped me:

_Data generator thread stop

Data generator thread stopData generator thread stopData generator thread stop

Traceback (most recent call last):
File "C:\Users\user_name\Conda\deps\usr\envs\p37_analiza_cefalo_1\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "C:\Users\user_name\Conda\deps\usr\envs\p37_analiza_cefalo_1\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\user_name\Conda\deps\usr\envs\p37_analiza_cefalo_1\lib\site-packages\tensorflow\python\client\session.py", line 1429, in call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Default AvgPoolingOp only supports NHWC on device type CPU
[[{{node net_1/unet/contracting/downsample0/AvgPool}}]]

I'm trying to run it on CPU. Is there a way to manage this? Maybe you could provide a reproducible environment for the repository (like e.g. docker maybe?)

Kind Regards!
JB

@blane85
Copy link

blane85 commented Jul 30, 2019

Hi JB,

I found the spine dataset at the following link, so maybe you can download it from there? https://imperialcollegelondon.app.box.com/s/erhcm28aablpy1725lt93xh6pk31ply1

For the network, I have unzipped all the individual folders (training and testing images) and saved to a folder called 'images' as instructed in the README file - if you would like for me to share this already prepared folder with you please let me know.

With regard to running on a CPU, why not sign up to use Google Colab (colab.research.google.com) - they will provide you free access to a GPU for training and running the network. You can also link directly to GitHub or your Google Drive for accessing files.

That said, I am still encountering problems as I am getting nan when I run main.py...

Thanks,

Beth

@rabitwhte
Copy link
Author

Hey, thanks for that, I will definitely try that.

About NANs. Did you see the TODO comment in the main.py:

if name == 'main':
# TODO: if the loss gets 'nan', either restart the network training or reduce the learning rate
# change networks

Did you try to reduce the learning rate?

You probably have seen it, but nevertheless I decided to write about it, because sometimes when you start digging really deep you miss things that are on the surface.

Regards,
Jakub

@christianpayer
Copy link
Owner

Thanks for your interest in our papers and the framework! Regarding your observed error message when running the code on the CPU, many parts of the framework are not tested and implemented for CPU usage. However, you could try to set the property 'data_format' to 'channels_last' instead of 'channels_first'. This may work on the CPU, however I did not test it most of the time.

I hope you understand that we use this framework mainly for prototyping our current research, so lots of documentation is missing and many parts are untested. I am working on improving this in the future. Unfortunately, my schedule is currently quite full and I don't have much time for working on the framework... But if you observe bugs or have any suggestions, just write a message!

Regarding your observed NaN values, I intentionally set the learning rate to be that high, such that training is faster. However, as you also observed, sometimes the loss function is becoming NaN. You have a couple of possible solutions: either just restart the program, reduce the learning rate, or change the optimizer and learning rate.
We found the SGD optimizer with Nesterov momentum to work best and to converge to a better minimum as other optimizers (e.g. Adam), at the cost of occasional NaN values. However, I'm quite sure that there is a better optimizer/learning rate configuration that we did not test.

I hope my notes could help you and clarify your questions.

Regards,
Christian

@rabitwhte
Copy link
Author

Hey Christian,

thanks for the reply. I do understand how research works - documentation is the last thing you want to do when you see improvements on the horizon :)

I just wanted to let you know that I created a new virtual env with p36 anaconda distribution with tensorflow-mkl and it works. Nevertheless, it takes ages to train the network on hand x-ray example with my resources. I will try the google colab then. @blane85 once (and if) I manage to make everything work on google colab I will share the code.

Regards,
Jakub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants