Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notebooks with many images in the output take an extremely long time to show a diff #396

Closed
stas00 opened this issue Jun 18, 2018 · 9 comments

Comments

@stas00
Copy link

stas00 commented Jun 18, 2018

  1. Some notebooks take an extremely long time to show a diff,

e.g. consider this notebook:

https://github.com/fastai/fastai/blob/master/courses/ml1/lesson2-rf_interpretation.ipynb

w/ or w/o any custom configuration it takes 10-15min (!) with CPU at 100% to get the diff output for:

http://localhost:8888/nbdime/git-difftool?base=fastai%2Fcourses%2Fml1%2Flesson2-rf_interpretation.ipynb

I think it has to do with multiple images appearing in the output of the notebook.

if I do the same with another notebook of about the same length, but with only a few images in the output:

https://github.com/fastai/fastai/blob/master/courses/ml1/lesson1-rf.ipynb

it takes some 10-30 secs to complete. My guestimate is that each image in the output adds some 30secs to the completion of the diff, but perhaps it's something else that causes that.

  1. if that's the case and it won't change - perhaps a kill/restart button would be useful at the top of nbdime output (where Hide unchanged cells radio button is)- otherwise I had to manually fish out the "hanging" process and killing it, restarting the notebook by hand. When I did have the patience to wait - it took more than 15min to complete.

Thank you.

@vidartf
Copy link
Collaborator

vidartf commented Jun 18, 2018

Thanks again for the report. For better debugging, could you share the two versions of the linked notebook that are used for diffing when the computation blows up? E.g. add a modified version of the one linked to a gist? Then I can now I'm reproducing your issue reliably, without having to install all the dependencies of the notebook code.

Separately, it might be reasonable for nbdime to do its diffing in a thread to try to prevent it blocking the entire server.

@stas00
Copy link
Author

stas00 commented Jun 18, 2018

No need to install anything, modify or run the notebook. Just check out, clear outputs and run the diff.

and about using a separate thread for nbdime -- yes, absolutely, -- I did notice that everything else was stuck while running the never-ending diff.

p.s. does nbdime run against the locally saved version of the notebook, or the currently loaded and potentially with changes which are unsaved yet, notebook? (if the former you probably want to save the notebook after clearing the outputs)

@vidartf
Copy link
Collaborator

vidartf commented Jun 19, 2018

@stas00 I did the following steps here:

  • Check out the notebook
  • Copy it
  • Clear the outputs in the copy
  • run nbdiff <original> <copy>.

While it did take a few seconds to print all the output to console, the actual diffing was < 1 second. For consistency, I also checked with:

  • Clone
  • Clear outputs
  • Save
  • Diff via notebook extension

and it too completed in less than a few seconds.

Also, when comparing outputs to an empty list of outputs, the diff should always be trivial. It makes no sense for such a diff to take a long time. To try to narrow the troubleshooting, I hope you could try/answer the following:

  • Do the diffing using the top-most method. This eliminates git and web things from the problem.
  • Which version of python are you using?

@stas00
Copy link
Author

stas00 commented Jun 19, 2018

Hmm, I thought it'd be simpler to reproduce. You're correct though, giving you the files explicitly would be the easiest. Here you go - both the original and the modified one:

https://www.dropbox.com/sh/vi4nrt0ph5fm5yg/AACpsLqVtsbEhe4jqDR5XBHua?dl=0

% time nbdiff lesson2-rf_interpretation-orig.ipynb lesson2-rf_interpretation.ipynb

real 11m52.799s
user 11m52.576s
sys 0m0.188s

Python 3.6.5
nbdime the master dev branch as of today

@vidartf
Copy link
Collaborator

vidartf commented Jun 20, 2018

Thanks, that reproduced the problem! This is similar to a previous problem discussed in #237, in that it is the text/html outputs that blow up the processing. That discussion stranded at the WIP stage, but I did another take following the discussions there in #400, which should be more reliable.

@stas00
Copy link
Author

stas00 commented Jun 20, 2018

That did the trick - the output is almost instant. Thank you, Vidar.

I guess the only remaining todo list from this issue is to move nbdime to its own thread - definitely not a show stopper, but a nice-to-have.

@claresloggett
Copy link

Another example: I'm experiencing a similar issue with the notebook https://github.com/claresloggett/mbs-dataviz-2018/blob/master/Plotly-and-Altair-demos-exercises.ipynb .

An nbiff of this notebook with a very slightly altered version of it either hangs, or more likely, takes a very long time to run (I can't say which as I haven't got it to finish in 3 hours or so).

This notebook contains several Plotly plots and Altair plots, both of which tend to put the data into the notebook. It also may contain injected javascript from either library at the setup stage - I am pretty sure that even though it says connected=True in init_notebook_mode(), there is actually injected Plotly javascript there from the notebook's history. I could imagine that any of these could be an issue.

Running nbdiff on this notebook with the outputs stripped works fine.

@vidartf
Copy link
Collaborator

vidartf commented Jul 17, 2018

@claresloggett Is this fixed in master? pip install -e git+https://github.com/jupyter/nbdime#egg=nbdime I wanted to do a new release with the latest fixes, but I got bogged down in other commitments.

@vidartf
Copy link
Collaborator

vidartf commented Aug 28, 2018

With #400 merged, this should be fixed.

@vidartf vidartf closed this as completed Aug 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants