Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Optimize approx text comparison #237

Closed
wants to merge 4 commits into from

Conversation

vidartf
Copy link
Collaborator

@vidartf vidartf commented Jan 12, 2017

Previously, base64 strings embedded in e.g. HTML outputs could blow up the diff time (>1 min, compared to < 10 sec after this PR). To side-step this issue, the autojunk feature of difflib.SequenceMatcher is enabled.

@martinal Is there a good rational for the autojunk feature being turned off previously? Or was this just a cautionary choice? I'm not entirely sure of the side effects of this, but I'm only enabling it for approximate comparison (not diffing), and it didn't break any test :)

I've also added some of the profiling helpers I used to diagnose this, with a helper text.

@martinal
Copy link
Collaborator

The autojunk behaviour has some nasty side effects, I turned it off again but tried dropping the most expensive ratio computation for the approximate alignment instead. Seems to work. I also did some other minor fixes/improvements in text alignment. And found a few unrelated bugs that I just fixed in here to save time.

if strict:
return compare_strings_approximate(x, y, threshold=0.95)
else:
return compare_strings_approximate(x, y, threshold=0.7, quick=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick already defaults to False, was this supposed to be True?

@vidartf
Copy link
Collaborator Author

vidartf commented Jan 19, 2017

Currently, the alignments in the diff are off with this branch (some cells that are quite clearly unrelated get aligned). Not sure what the exact cause it.


# Skip slower and stricter check unless if quick is set
if quick:
return True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this maybe return False?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, if I do, very few things get aligned (but there seems to be no false positives).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be True. But I guess this was a bad idea. The quick ratios are too inaccurate to trust.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try something like this instead:
https://pypi.python.org/pypi/editdistance
but it's a compiled extension.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this library and for the notebooks you sent me it took more than 10 GB memory before I killed it. Not sure what to do now...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old behavior is probably the best, but that (certain text-type) MIME data comparisons can be optimized by splitting on lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The editdistance package could possibly be an optional dependency (i.e. use it if available)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, 10GB. Let us drop that for now then.

@martinal
Copy link
Collaborator

Maybe we should just set a size limit on comparing outputs. If len(value) > limit only full equality will be considered equal.

@vidartf vidartf changed the title Optimize approx text comparison [WIP] Optimize approx text comparison Feb 2, 2017
vidartf and others added 4 commits March 28, 2017 11:14
@vidartf
Copy link
Collaborator Author

vidartf commented Jul 2, 2018

Superseded by #400.

@vidartf vidartf closed this Jul 2, 2018
@vidartf vidartf deleted the optimize-diff branch July 2, 2018 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants