-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Optimize approx text comparison #237
Conversation
The autojunk behaviour has some nasty side effects, I turned it off again but tried dropping the most expensive ratio computation for the approximate alignment instead. Seems to work. I also did some other minor fixes/improvements in text alignment. And found a few unrelated bugs that I just fixed in here to save time. |
nbdime/diffing/notebooks.py
Outdated
if strict: | ||
return compare_strings_approximate(x, y, threshold=0.95) | ||
else: | ||
return compare_strings_approximate(x, y, threshold=0.7, quick=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick already defaults to False, was this supposed to be True?
Currently, the alignments in the diff are off with this branch (some cells that are quite clearly unrelated get aligned). Not sure what the exact cause it. |
nbdime/diffing/generic.py
Outdated
|
||
# Skip slower and stricter check unless if quick is set | ||
if quick: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this maybe return False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, if I do, very few things get aligned (but there seems to be no false positives).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be True. But I guess this was a bad idea. The quick ratios are too inaccurate to trust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try something like this instead:
https://pypi.python.org/pypi/editdistance
but it's a compiled extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this library and for the notebooks you sent me it took more than 10 GB memory before I killed it. Not sure what to do now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old behavior is probably the best, but that (certain text-type) MIME data comparisons can be optimized by splitting on lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The editdistance package could possibly be an optional dependency (i.e. use it if available)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, 10GB. Let us drop that for now then.
Maybe we should just set a size limit on comparing outputs. If len(value) > limit only full equality will be considered equal. |
approximate alignment. Also improve the text comparison a bit more by making cutoffs for small text/plain situations more specific.
Superseded by #400. |
Previously, base64 strings embedded in e.g. HTML outputs could blow up the diff time (>1 min, compared to < 10 sec after this PR). To side-step this issue, the
autojunk
feature ofdifflib.SequenceMatcher
is enabled.@martinal Is there a good rational for the autojunk feature being turned off previously? Or was this just a cautionary choice? I'm not entirely sure of the side effects of this, but I'm only enabling it for approximate comparison (not diffing), and it didn't break any test :)
I've also added some of the profiling helpers I used to diagnose this, with a helper text.