Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Unable to Download big models #644

Open
tommy87 opened this issue Mar 18, 2016 · 5 comments
Open

Unable to Download big models #644

tommy87 opened this issue Mar 18, 2016 · 5 comments
Labels

Comments

@tommy87
Copy link

tommy87 commented Mar 18, 2016

When try to download a big model (Total learned parameters: 192,498,050) i get the message:

Error: unable to connect to NVIDIA DIGITS

and the log file says:
[1271] [CRITICAL] WORKER TIMEOUT (pid: 5706)
[1271] [CRITICAL] WORKER TIMEOUT (pid: 5706)
[5852] [INFO] Booting worker with pid: 5852

@lukeyeager
Copy link
Member

Sorry about that. That's annoying. The problem is that DIGITS is working really hard to compress your huge file, and it can't get the job done before hitting the gunicorn timeout. If you use a link like:

/models/20160317-160504-f18c/download.tar

Instead of

/models/20160317-160504-f18c/download

Then you'll get the uncompressed tarball, which DIGITS should be able to create before it hits the timeout (full list of allowed extensions here).

If that still doesn't work, you should be able to change the gunicorn timeout value (http://docs.gunicorn.org/en/stable/settings.html#timeout). I think you'd want to add timeout = 60 to /usr/share/digits/gunicorn_config.py and restart your server, but I haven't tested it. Let me know if it comes to that and I can help you.

@lukeyeager lukeyeager added the bug label Mar 18, 2016
@tommy87
Copy link
Author

tommy87 commented Aug 18, 2016

sorry i forget to answer, but increasing the timeout has helped me :)

But maybe you shouldn't wait for a timeout, i think it is better to ask the worker how far he is and if he doesn't response or have no progress then you can break the process

@lukeyeager
Copy link
Member

The problem is that we don't currently use a worker to do the compression - it's done by the server process. That's why the server locks up.

Glad to hear the timeout hack was helpful!

@lukeyeager
Copy link
Member

By removing gunicorn in #1127, we've sort of sidestepped the issue for now since you'll be accessing Flask through werkzeug now.

But we still need to address the fact that the server locks up when zipping a big model.

@andrewcar
Copy link

By navigating to /usr/share/digits/digits/jobs/ you can see the individual job folders that contain the .caffemodel, .prototxt, .solverstate, .pickle, and .log files.

I was able to "scp" the caffemodel that was failing every time from the GUI download button in DIGITS.

If anyone has any questions, let me know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants