Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeouts to queue messages #3196

Merged
merged 14 commits into from
Feb 21, 2023
Merged

Add timeouts to queue messages #3196

merged 14 commits into from
Feb 21, 2023

Conversation

freddyaboulton
Copy link
Collaborator

@freddyaboulton freddyaboulton commented Feb 14, 2023

Description

Fixes #3187

I'm not 100% sure what the cause of the original issue is. Based on what I'll explain below, I think it may be that the front-end client is dropping connections during load events that are queued. Even if that isn't the root cause, I think what I propose here is a reasonable change in behavior that does fix the problem.

These are my findings:

I noticed that the process_events method would get stuck in the get_message call indefinitely. See the example print statements below:

disconnecting
resetting active jobs
cleaning
gathering data
sending send_data
sent send_data
getting data              <----- should print "got data" next but it never does!
about to send estimation
Sent estimation
about to send estimation
Sent estimation
about to send estimation
Sent estimation

Since the event was already in the active jobs slot and concurrency_count=1 by default, no other jobs would get pulled off the queue which is why it grew forever.

Since I noticed that it was getting stuck on get_message, I figured event.websocket.receive_json() was the culprit. I decided to add a 5 second time out and I haven't noticed any weirdness in https://huggingface.co/spaces/gradio/blocks_random_slider or https://huggingface.co/spaces/gradio/native_plots (I installed this PR on those spaces) . I also noticed that the timeout statements were getting printed, which means that we successfully avoid waiting for ever!

Will monitor for a bit though in case the problem still persists. Although, adding a timeout seems like a reasonable thing to do anyways.

I think a five second timeout is reasonable as this just applies to sending/receiving a json payload.

I feel like this is how I sound like explaining this:

image

Checklist:

  • I have performed a self-review of my own code
  • I have added a short summary of my change to the CHANGELOG.md
  • My code follows the style guidelines of this project
  • I have commented my code in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

A note about the CHANGELOG

Hello 👋 and thank you for contributing to Gradio!

All pull requests must update the change log located in CHANGELOG.md, unless the pull request is labeled with the "no-changelog-update" label.

Please add a brief summary of the change to the Upcoming Release > Full Changelog section of the CHANGELOG.md file and include
a link to the PR (formatted in markdown) and a link to your github profile (if you like). For example, "* Added a cool new feature by [@myusername](link-to-your-github-profile) in [PR 11111](https://github.com/gradio-app/gradio/pull/11111)".

If you would like to elaborate on your change further, feel free to include a longer explanation in the other sections.
If you would like an image/gif/video showcasing your feature, it may be best to edit the CHANGELOG file using the
GitHub web UI since that lets you upload files directly via drag-and-drop.

@gradio-pr-bot
Copy link
Collaborator

All the demos for this PR have been deployed at https://huggingface.co/spaces/gradio-pr-deploys/pr-3196-all-demos

@freddyaboulton freddyaboulton changed the title WIP: Add timeouts to queue messages Add timeouts to queue messages Feb 15, 2023
@freddyaboulton freddyaboulton marked this pull request as ready for review February 15, 2023 18:13
@pngwn
Copy link
Member

pngwn commented Feb 15, 2023

Isn't five seconds a really long time?

@freddyaboulton
Copy link
Collaborator Author

@pngwn Yea we can do 1 second? Sending the message should be fast but don't want to set something too low that would impact people on slow networks.

@abidlabs
Copy link
Member

Just getting a chance to look at this. Very cool hunting down the issue. What you're proposing is reasonable, although I worry that a 5 second timeout might actually be too little given the fact that these websocket messages could contain large files particularly if multiple images are involved, which are sent as base64 strings

Also, it is a very good sign that:

But currently the other Spaces from the issue such as:

are also running fine. Have you seen these Spaces build up a large queue while the others haven't?

@freddyaboulton
Copy link
Collaborator Author

@abidlabs Thanks for the review! Good point about the latency for large images. Let me toy with setting variable timeouts for the different messages. I think the getting the data and sending the result should have long timeouts (a minute) but the other message payloads are quite small and I think a second will suffice.

Will ping when this is ready for review again!

@freddyaboulton
Copy link
Collaborator Author

Also, altair_plot just got stuck for me and that demo does not include the timeouts. So I think it's a good sign that instituting some timeouts is beneficial. Just a matter of making sure we don't set it too small and shoot ourselves in the foot!

image

@abidlabs
Copy link
Member

Can confirm the altair_plots plot is stuck for me, while native_plots and blocks_random_slider are looking good!

@freddyaboulton
Copy link
Collaborator Author

@abidlabs I set the get_message timeout to 60 seconds (that's where the biggest data will be sent) and all other timeouts to 1 second.

That lets me upload a 44 mb image to this space: https://huggingface.co/spaces/gradio/image_mod and not get a timeout. I think 60 seconds is reasonable since I don't think total image payloads above 50mb will be common. And for other file types we will only send the filename when #3191 is ready. That being said, we can also set it higher.

Also, users will be notified if they reached the timeout uploading their data:

timeout_message

@abidlabs
Copy link
Member

abidlabs commented Feb 17, 2023

Getting some interesting behavior when testing out the deployed PR Space. The image_mod demo is not successfully running a prediction:

image

Gonna take a deeper look later today but thought I'd flag this for now @freddyaboulton

@freddyaboulton
Copy link
Collaborator Author

@abidlabs I can't reproduce 🤔

image

@abidlabs
Copy link
Member

I think this looks good. Let me just resurrect the queue benchmarking script and test the performance on this branch vs. main

@abidlabs
Copy link
Member

Ran the benchmarking script on this branch vs. main and here's what I found as the avg latency over 1000 runs (seconds):

test main 3187-load-events-queue
text 1.7082971923546104 1.7173802871112676
image 1.761333772516626 1.7226356868743897
audio 1.7082971923546104 1.7681459769481371
video 2.231178657029026 2.14179096728798

Given the results, I think we can go ahead and merge

@freddyaboulton
Copy link
Collaborator Author

Thank you for testing @abidlabs 🙏 Will merge once ci passes!

@freddyaboulton freddyaboulton merged commit 5df113a into main Feb 21, 2023
@freddyaboulton freddyaboulton deleted the 3187-load-events-queue branch February 21, 2023 21:44
dawoodkhan82 pushed a commit that referenced this pull request Feb 22, 2023
* Fix + test

* Remove print statements + fix import for 3.7

* CHANGELOG

* Remove more print statements

* Add 60 second timeout for uploading data

* Fix test

---------

Co-authored-by: Abubakar Abid <abubakar@huggingface.co>
dawoodkhan82 added a commit that referenced this pull request Feb 23, 2023
* dropdown

* more dropdown updates

* dropdown styling + option visibility

* changelog

* notebook

* fix test

* Allow more image formats (#3225)

* add wildcard to image input

* simplify mime types

* changelog

* regen noteboks

---------

Co-authored-by: Abubakar Abid <abubakar@huggingface.co>
Co-authored-by: pngwn <hello@pngwn.io>

* fix webcam mirroring (#3245)

* fix webcam

* changelog

* fix changelog

* fix changelog

* fix changelog

---------

Co-authored-by: Abubakar Abid <abubakar@huggingface.co>

* Add `interactive=False` mode to `gr.Button` (#3266)

* add interactive=False to button

* add interactive=True by default

* changelog

* fix frontend

* fix backend test

* formatting

* review changes

* LaTeX height fix (#3258)

* latex height fix

* changelog

* formatting

* em

* em

* accidentally added script (#3273)

* Adding a script to benchmark the queue (#3272)

* added benchmark queue script

* changelg

* fix instructions

* Fix matplotlib image size (#3274)

* Fix matplotlib css

* CHANGELOG

* Undo lockfile

* Add timeouts to queue messages (#3196)

* Fix + test

* Remove print statements + fix import for 3.7

* CHANGELOG

* Remove more print statements

* Add 60 second timeout for uploading data

* Fix test

---------

Co-authored-by: Abubakar Abid <abubakar@huggingface.co>

* icons

* separate options into component

* formatting

* changelog

* changelog

* fix ui tests

* formatting again...

* backend test fix

* format

* doc fixes

---------

Co-authored-by: Abubakar Abid <abubakar@huggingface.co>
Co-authored-by: fienestar <fienestar@gmail.com>
Co-authored-by: pngwn <hello@pngwn.io>
Co-authored-by: Freddy Boulton <alfonsoboulton@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load events don't work well on HF spaces if queue enabled
4 participants