Within-line edit highlighting is unpredictable #140

jakabk · 2020-04-28T11:08:22Z

Delta version: delta 0.1.1
I'm using mate-terminal (1.20.2) on Debian Buster.
The 24-bit color test seems to work, but the diff highlighting does not appear,
What do I wrong? Is it some extra config needed? I attach a screenshot about my terminal.

echo $TERM
xterm
echo $COLORTERM
truecolor

0xC0FFEE · 2020-04-28T14:20:05Z

FWICT this only seems to happen for some inter-line changes. However I haven't figured out the pattern yet.

Considering this line as the base:

did_emsg = FALSE;

changing it to:

did_emsg == FALSE;

Correctly highlights the diff:

However changing the line to:

did_emsg = TRUE;

Fails:

Related: #119

dandavison · 2020-04-28T15:08:14Z

Hi @jakabk @0xC0FFEE thanks for this feedback. @0xC0FFEE as you've noted this is the same issue as #119. So, your example starts to be highlighted with --max-line-distance set to a larger value (screenshot below).

Basically, this needs to be improved. The issue here is that in both examples (@jakabk's and @0xC0FFEE's) the changed section is a fairly large proportion of the line length, and delta is being (too) conservative and refusing to treat them as homologous lines. But that doesn't make much sense seeing as there is one removed line and one added line. It starts to make a bit more sense when there are, say, 1 removed line and 2 added lines, because then delta needs to make a decision about which of the 2 added lines is the replacement.

If you have any thoughts about how we should go about improving this then please let me know! My initial thought is that the algorithm should perhaps be different in the two cases: (1) same number of red and green lines vs (2) different numbers of red and green lines. GitHub/GitLab do not output the within-line highlights at all in case (2), and delta is trying to go one better here, and I think needs to do some work to stop making a mess of the common case.

0xC0FFEE · 2020-04-28T17:16:50Z

Hi @dandavison, the approach to differentiate the two cases (1) and (2) seems to make sense to me. From my experience diff-so-fancy does a good job with (1). As I've never had tooling for (2), I currently have no educated thoughts on this, except that it sounds fun and I've seen diffs today, where it really helped 👍.

My initial take was, that a generic distance is not enough and instead (or in addition) something like the longest common sub string could be utilized. IMHO most of the time we want to highlight within-line changes when they happen to be contiguous and this is what I would explore. However I'd have a look at diff-so-fancy (and possibly other tools as well e.g. vim, gitlab, ...) before solving this from scratch. This sounds like a really common problem to me and maybe even writing this in library form might be beneficial.

jakabk · 2020-04-29T13:10:02Z

Hi, @dandavison I'm joining to @0xC0FFEE I'd like to support the idea handling the two cases separated as well.

In the first case (one + and one -) I'd like to see the change highlighted. The recommended switch: --max-line-distance, makes much more usable the diff highlighting but it's not perfect.

delta --max-line-distance 0.8

In the second case, I find @0xC0FFEE idea good: using the longest common substring to choose the appropriate lines and showing the diff between the two chosen lines.

dandavison · 2020-04-30T14:00:42Z

Thanks @0xC0FFEE @jakabk @igordovgaluk. I definitely want to get an improved v2 within-line edit inference algorithm out. It would be really helpful if you could carry on submitting more examples that you encounter of bad (or good) within-line highlighting (as plain text, just as comments in this issue is fine, or as a PR against the tests/examples directory). I wonder whether it ultimately will be helpful to have a large-ish collection of such examples on which we can score alternative algorithms; possibly even explore the parameter space programatically. I've added scripts with your examples so far to the tests/examples/ dir in the repo.

Here's a counter-example:

diff --git a/file b/file
index 1dc5384..fa4f6e9 100644
--- a/file
+++ b/file
@@ -1,3 +1,3 @@
-aaaa a aaa
-bbbb b bbb
-cccc c ccc
+bbbb ! bbb
+dddd d ddd
+cccc ! ccc

Current delta does this one correctly (and this is what I was going for with the original algorithm):

Do you see value in that sort of behavior? I don't think any of diff-so-fancy, GitLab, GitHub etc get that one right.

If we switch to "naive line matching" when there are equal numbers of plus and minus lines (i.e. always assume that the i-th green evolved from the i-th red line), then we of course will "fail" on this example:

I'm not saying that it's essential to do this example correctly: it's obviously important to do the common cases well, and I think the equal-line-numbers that you're highlighting are common cases.

So, this needs more thought but off the top of my head here are some considerations:

Whether to match lines "naively" when there are equal numbers of red and green lines
If so, I think we still need some sort of threshold to avoid "forcing" noisy/spurious edits on two completely different lines.
If not, then what algorithm is used for deciding which lines are homologous? The original (current) algorithm uses a needleman-wunsch / levenshtein edit distance, and demands that that exceeds a certain threshold in order for two lines to be considered homologs. It's greedy, to avoid a quadratic-time search over all matching pairs (i.e., basically, assumes the lines are not out of order).
Is it possible to have the best of both worlds where we succeed on things like the "counter-example" above and also get the simple examples of same-number-of-red-and-green right?
Can we put together a collection of examples and score alternative algorithms?
What should the defaults be if delta ends up supporting alternative algorithms?

0xC0FFEE · 2020-05-01T11:52:02Z

Do you see value in that sort of behavior? I don't think any of diff-so-fancy, GitLab, GitHub etc get that one right.

I definitely think that this is can be valuable (even if it is still somewhat hard for me to comprehend and understand what changed). Also I guess you are right, as I've not seen cases like this handled before in other tools 👍.

I think gathering (real world) examples were delta produces unexpected outputs will help in the future.

dandavison · 2020-05-01T12:54:43Z

Great. Just in case it's not already clear, handling the unequal-number-of-lines case in general is something that delta does that AFAIK other tools do not. It's that which is currently causing us to encounter suboptimal examples in the same-number-of-lines case (but I'm sure we can improve this). Here's an example of Delta's unequal-number-of-lines handling:

Github:

Delta:

dandavison · 2020-05-01T15:22:44Z

Here's something to play with if you can build from source: I've pushed 5162437 to master. This allows you to activate an experimental feature by setting an environment variable:

export DELTA_EXPERIMENTAL_MAX_LINE_DISTANCE_FOR_NAIVELY_PAIRED_LINES=0.8

Here's what that does:

If a hunk contains the same number of added and removed lines, then those will be assumed to match "naively", i.e. the i-th removed line is homologous to the i-th added line.
If the distance between the two lines is less than 0.8, then apply emphasis colors to the inferred edit operations transforming the old line into the new line.

"distance" is a number between 0 (identical lines) and 1 (totally different lines). It is basically defined as

             number of characters involved in inferred edits
distance = -------------------------------------------------
                      number of characters in the line

See edits.rs for exact calculation.

So, for example, if the env var has a value > 0.32 then we get

In practice I'm thinking that values somewhere above 0.6 will be appropriate but it would be great to hear what you find works with this version of the algorithm.

0xC0FFEE · 2020-05-01T20:06:54Z

Thanks for providing a usable version of your idea so promptly. I'll test-drive this next week and report interesting results back. Kudos!

dandavison · 2020-05-01T20:50:40Z

Just to be clear, I put this up simply because it was very easy to do from what we have already; there could well be merit in more involved changes.

dandavison · 2020-05-09T14:59:21Z

If anyone tries out DELTA_EXPERIMENTAL_MAX_LINE_DISTANCE_FOR_NAIVELY_PAIRED_LINES then it would be really helpful to hear about your findings! I think that setting it to a large value (closer to 1.0) solves many of the "I was expecting this to be highlighted but it isn't" issues. However, we need to have some control on it to prevent it inferring "edits" between totally unrelated lines.

This aspect of delta development (improving the within-line highlight algorithms and options) is one that could benefit a lot from having multiple people's thoughts (since we use different languages, work in code bases with different styles, etc).

phillipwood · 2022-11-28T16:45:54Z

delta --max-line-distance 0.8

With PR #1244 this is now highlighted correctly without having to tweak max-line-distance

phillipwood · 2022-11-29T14:51:31Z

So, this needs more thought but off the top of my head here are some considerations:

* Whether to match lines "naively" when there are equal numbers of red and green lines

* If so, I think we still need some sort of threshold to avoid "forcing" noisy/spurious edits on two completely different lines.

* If not, then what algorithm is used for deciding which lines are homologous?   The original (current) algorithm uses a needleman-wunsch / levenshtein edit distance, and demands that that exceeds a certain threshold in order for two lines to be considered homologs. It's greedy, to avoid a quadratic-time search over all matching pairs (i.e., basically, assumes the lines are not out of order).

I wonder if it would be practical to use the patience diff algorithm to pair up lines based on unique tokens within them and then use the current LCS calculation to highlight them once they're paired. That would avoid a quadratic search to pair the lines up while hopefully avoiding the suboptimal matches from the greedy matching that delta currently uses.

dandavison changed the title ~~Diff highlight does not seem to work~~ Within-line edit highlighting is unpredictable Apr 28, 2020

dandavison mentioned this issue Apr 29, 2020

Improve within-line edit inference defaults #119

Closed

dandavison added a commit that referenced this issue Apr 30, 2020

Add examples of within-line edit inference from #119, #140

692fb19

This was referenced May 5, 2020

🚀 Highlighting of --word-diff output, and custom "squashed" diff output #152

Open

🐛 Incorrectly highlighting removed characters #175

Closed

This was referenced May 11, 2020

🐛within-line insertion/deletion detection is not working in all cases that work with diff-highlight #184

Closed

🚀 Minimal color mode reusing the default colors #193

Closed

dandavison mentioned this issue May 20, 2020

🐛Missing word highlighting #198

Closed

dandavison mentioned this issue May 29, 2020

🚀RFC: New color/style options, diff-so-fancy/diff-highlight emulation #205

Closed

dandavison mentioned this issue Jun 28, 2020

🚀 improve diff -u comparison #229

Closed

dandavison mentioned this issue Oct 3, 2020

🐛 Strange choices in word diff highlighting #342

Closed

dandavison mentioned this issue Feb 11, 2021

🚀 Minimize the number of highlighted blocks #521

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Within-line edit highlighting is unpredictable #140

Within-line edit highlighting is unpredictable #140

jakabk commented Apr 28, 2020 •

edited

Loading

0xC0FFEE commented Apr 28, 2020 •

edited

Loading

dandavison commented Apr 28, 2020 •

edited

Loading

0xC0FFEE commented Apr 28, 2020

jakabk commented Apr 29, 2020

dandavison commented Apr 30, 2020 •

edited

Loading

0xC0FFEE commented May 1, 2020

dandavison commented May 1, 2020

dandavison commented May 1, 2020 •

edited

Loading

0xC0FFEE commented May 1, 2020

dandavison commented May 1, 2020

dandavison commented May 9, 2020

phillipwood commented Nov 28, 2022

phillipwood commented Nov 29, 2022

Within-line edit highlighting is unpredictable #140

Within-line edit highlighting is unpredictable #140

Comments

jakabk commented Apr 28, 2020 • edited Loading

0xC0FFEE commented Apr 28, 2020 • edited Loading

dandavison commented Apr 28, 2020 • edited Loading

0xC0FFEE commented Apr 28, 2020

jakabk commented Apr 29, 2020

dandavison commented Apr 30, 2020 • edited Loading

0xC0FFEE commented May 1, 2020

dandavison commented May 1, 2020

dandavison commented May 1, 2020 • edited Loading

0xC0FFEE commented May 1, 2020

dandavison commented May 1, 2020

dandavison commented May 9, 2020

phillipwood commented Nov 28, 2022

phillipwood commented Nov 29, 2022

jakabk commented Apr 28, 2020 •

edited

Loading

0xC0FFEE commented Apr 28, 2020 •

edited

Loading

dandavison commented Apr 28, 2020 •

edited

Loading

dandavison commented Apr 30, 2020 •

edited

Loading

dandavison commented May 1, 2020 •

edited

Loading