[WIP] Optimize approx text comparison #237

vidartf · 2017-01-12T17:55:11Z

Previously, base64 strings embedded in e.g. HTML outputs could blow up the diff time (>1 min, compared to < 10 sec after this PR). To side-step this issue, the autojunk feature of difflib.SequenceMatcher is enabled.

@martinal Is there a good rational for the autojunk feature being turned off previously? Or was this just a cautionary choice? I'm not entirely sure of the side effects of this, but I'm only enabling it for approximate comparison (not diffing), and it didn't break any test :)

I've also added some of the profiling helpers I used to diagnose this, with a helper text.

martinal · 2017-01-19T15:59:45Z

The autojunk behaviour has some nasty side effects, I turned it off again but tried dropping the most expensive ratio computation for the approximate alignment instead. Seems to work. I also did some other minor fixes/improvements in text alignment. And found a few unrelated bugs that I just fixed in here to save time.

vidartf · 2017-01-19T17:40:04Z

nbdime/diffing/notebooks.py

+    if strict:
+        return compare_strings_approximate(x, y, threshold=0.95)
+    else:
+        return compare_strings_approximate(x, y, threshold=0.7, quick=False)


Quick already defaults to False, was this supposed to be True?

vidartf · 2017-01-19T18:17:27Z

Currently, the alignments in the diff are off with this branch (some cells that are quite clearly unrelated get aligned). Not sure what the exact cause it.

vidartf · 2017-01-19T18:19:11Z

nbdime/diffing/generic.py

+
+    # Skip slower and stricter check unless if quick is set
+    if quick:
+        return True


Should this maybe return False?

Probably not, if I do, very few things get aligned (but there seems to be no false positives).

Should be True. But I guess this was a bad idea. The quick ratios are too inaccurate to trust.

We could try something like this instead:
https://pypi.python.org/pypi/editdistance
but it's a compiled extension.

I tested this library and for the notebooks you sent me it took more than 10 GB memory before I killed it. Not sure what to do now...

I think the old behavior is probably the best, but that (certain text-type) MIME data comparisons can be optimized by splitting on lines.

The editdistance package could possibly be an optional dependency (i.e. use it if available)?

Haha, 10GB. Let us drop that for now then.

martinal · 2017-01-24T13:05:22Z

Maybe we should just set a size limit on comparing outputs. If len(value) > limit only full equality will be considered equal.

approximate alignment. Also improve the text comparison a bit more by making cutoffs for small text/plain situations more specific.

vidartf · 2018-07-02T08:29:17Z

Superseded by #400.

vidartf commented Jan 19, 2017

View reviewed changes

vidartf changed the title ~~Optimize approx text comparison~~ [WIP] Optimize approx text comparison Feb 2, 2017

vidartf mentioned this pull request Feb 8, 2017

Various fixes and minor adjustments #257

Merged

vidartf and others added 4 commits March 28, 2017 11:14

Optimize diff: Use autojunk in approx text compare

82c4e7c

Don't use autojunk. Instead drop the most expensive .ratio() call for

11ebe75

approximate alignment. Also improve the text comparison a bit more by making cutoffs for small text/plain situations more specific.

Switch quick to True...

3212202

Drop quick parameter to string comparison.

490819a

vidartf force-pushed the optimize-diff branch from c6ae1dd to 490819a Compare March 28, 2017 09:16

vidartf mentioned this pull request Apr 20, 2017

Add caching for mimedata comparison #282

Merged

This was referenced Jun 20, 2018

Optimize text comparison for text MIME data #400

Merged

notebooks with many images in the output take an extremely long time to show a diff #396

Closed

vidartf closed this Jul 2, 2018

vidartf deleted the optimize-diff branch July 2, 2018 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Optimize approx text comparison #237

[WIP] Optimize approx text comparison #237

vidartf commented Jan 12, 2017

martinal commented Jan 19, 2017

vidartf Jan 19, 2017

vidartf commented Jan 19, 2017

vidartf Jan 19, 2017

vidartf Jan 19, 2017

martinal Jan 19, 2017

martinal Jan 19, 2017

martinal Jan 19, 2017

vidartf Jan 19, 2017

vidartf Jan 19, 2017

vidartf Jan 19, 2017

martinal commented Jan 24, 2017

vidartf commented Jul 2, 2018

[WIP] Optimize approx text comparison #237

[WIP] Optimize approx text comparison #237

Conversation

vidartf commented Jan 12, 2017

martinal commented Jan 19, 2017

Choose a reason for hiding this comment

vidartf commented Jan 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinal commented Jan 24, 2017

vidartf commented Jul 2, 2018