Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Within-line edit highlighting is unpredictable #140

Open
jakabk opened this issue Apr 28, 2020 · 13 comments
Open

Within-line edit highlighting is unpredictable #140

jakabk opened this issue Apr 28, 2020 · 13 comments

Comments

@jakabk
Copy link

jakabk commented Apr 28, 2020

Delta version: delta 0.1.1
I'm using mate-terminal (1.20.2) on Debian Buster.
The 24-bit color test seems to work, but the diff highlighting does not appear,
What do I wrong? Is it some extra config needed? I attach a screenshot about my terminal.

echo $TERM
xterm
echo $COLORTERM
truecolor
2020-04-28_13-06

@0xC0FFEE
Copy link

0xC0FFEE commented Apr 28, 2020

FWICT this only seems to happen for some inter-line changes. However I haven't figured out the pattern yet.

Considering this line as the base:

did_emsg = FALSE;

changing it to:

did_emsg == FALSE;

Correctly highlights the diff:

image

However changing the line to:

did_emsg = TRUE;

Fails:

image

Related: #119

@dandavison
Copy link
Owner

dandavison commented Apr 28, 2020

Hi @jakabk @0xC0FFEE thanks for this feedback. @0xC0FFEE as you've noted this is the same issue as #119. So, your example starts to be highlighted with --max-line-distance set to a larger value (screenshot below).

Basically, this needs to be improved. The issue here is that in both examples (@jakabk's and @0xC0FFEE's) the changed section is a fairly large proportion of the line length, and delta is being (too) conservative and refusing to treat them as homologous lines. But that doesn't make much sense seeing as there is one removed line and one added line. It starts to make a bit more sense when there are, say, 1 removed line and 2 added lines, because then delta needs to make a decision about which of the 2 added lines is the replacement.

If you have any thoughts about how we should go about improving this then please let me know! My initial thought is that the algorithm should perhaps be different in the two cases: (1) same number of red and green lines vs (2) different numbers of red and green lines. GitHub/GitLab do not output the within-line highlights at all in case (2), and delta is trying to go one better here, and I think needs to do some work to stop making a mess of the common case.

image

@dandavison dandavison changed the title Diff highlight does not seem to work Within-line edit highlighting is unpredictable Apr 28, 2020
@0xC0FFEE
Copy link

Hi @dandavison, the approach to differentiate the two cases (1) and (2) seems to make sense to me. From my experience diff-so-fancy does a good job with (1). As I've never had tooling for (2), I currently have no educated thoughts on this, except that it sounds fun and I've seen diffs today, where it really helped 👍.

My initial take was, that a generic distance is not enough and instead (or in addition) something like the longest common sub string could be utilized. IMHO most of the time we want to highlight within-line changes when they happen to be contiguous and this is what I would explore. However I'd have a look at diff-so-fancy (and possibly other tools as well e.g. vim, gitlab, ...) before solving this from scratch. This sounds like a really common problem to me and maybe even writing this in library form might be beneficial.

@jakabk
Copy link
Author

jakabk commented Apr 29, 2020

Hi, @dandavison I'm joining to @0xC0FFEE I'd like to support the idea handling the two cases separated as well.

In the first case (one + and one -) I'd like to see the change highlighted. The recommended switch: --max-line-distance, makes much more usable the diff highlighting but it's not perfect.

delta --max-line-distance 0.8
image

In the second case, I find @0xC0FFEE idea good: using the longest common substring to choose the appropriate lines and showing the diff between the two chosen lines.

@dandavison
Copy link
Owner

dandavison commented Apr 30, 2020

Thanks @0xC0FFEE @jakabk @igordovgaluk. I definitely want to get an improved v2 within-line edit inference algorithm out. It would be really helpful if you could carry on submitting more examples that you encounter of bad (or good) within-line highlighting (as plain text, just as comments in this issue is fine, or as a PR against the tests/examples directory). I wonder whether it ultimately will be helpful to have a large-ish collection of such examples on which we can score alternative algorithms; possibly even explore the parameter space programatically. I've added scripts with your examples so far to the tests/examples/ dir in the repo.

Here's a counter-example:

diff --git a/file b/file
index 1dc5384..fa4f6e9 100644
--- a/file
+++ b/file
@@ -1,3 +1,3 @@
-aaaa a aaa
-bbbb b bbb
-cccc c ccc
+bbbb ! bbb
+dddd d ddd
+cccc ! ccc

Current delta does this one correctly (and this is what I was going for with the original algorithm):

image

Do you see value in that sort of behavior? I don't think any of diff-so-fancy, GitLab, GitHub etc get that one right.

If we switch to "naive line matching" when there are equal numbers of plus and minus lines (i.e. always assume that the i-th green evolved from the i-th red line), then we of course will "fail" on this example:

image

I'm not saying that it's essential to do this example correctly: it's obviously important to do the common cases well, and I think the equal-line-numbers that you're highlighting are common cases.

So, this needs more thought but off the top of my head here are some considerations:

  • Whether to match lines "naively" when there are equal numbers of red and green lines
  • If so, I think we still need some sort of threshold to avoid "forcing" noisy/spurious edits on two completely different lines.
  • If not, then what algorithm is used for deciding which lines are homologous? The original (current) algorithm uses a needleman-wunsch / levenshtein edit distance, and demands that that exceeds a certain threshold in order for two lines to be considered homologs. It's greedy, to avoid a quadratic-time search over all matching pairs (i.e., basically, assumes the lines are not out of order).
  • Is it possible to have the best of both worlds where we succeed on things like the "counter-example" above and also get the simple examples of same-number-of-red-and-green right?
  • Can we put together a collection of examples and score alternative algorithms?
  • What should the defaults be if delta ends up supporting alternative algorithms?

@0xC0FFEE
Copy link

0xC0FFEE commented May 1, 2020

Do you see value in that sort of behavior? I don't think any of diff-so-fancy, GitLab, GitHub etc get that one right.

I definitely think that this is can be valuable (even if it is still somewhat hard for me to comprehend and understand what changed). Also I guess you are right, as I've not seen cases like this handled before in other tools 👍.

I think gathering (real world) examples were delta produces unexpected outputs will help in the future.

@dandavison
Copy link
Owner

Great. Just in case it's not already clear, handling the unequal-number-of-lines case in general is something that delta does that AFAIK other tools do not. It's that which is currently causing us to encounter suboptimal examples in the same-number-of-lines case (but I'm sure we can improve this). Here's an example of Delta's unequal-number-of-lines handling:

Github:

image

Delta:

image

@dandavison
Copy link
Owner

dandavison commented May 1, 2020

Here's something to play with if you can build from source: I've pushed 5162437 to master. This allows you to activate an experimental feature by setting an environment variable:

export DELTA_EXPERIMENTAL_MAX_LINE_DISTANCE_FOR_NAIVELY_PAIRED_LINES=0.8

Here's what that does:

  1. If a hunk contains the same number of added and removed lines, then those will be assumed to match "naively", i.e. the i-th removed line is homologous to the i-th added line.

  2. If the distance between the two lines is less than 0.8, then apply emphasis colors to the inferred edit operations transforming the old line into the new line.

"distance" is a number between 0 (identical lines) and 1 (totally different lines). It is basically defined as

             number of characters involved in inferred edits
distance = -------------------------------------------------
                      number of characters in the line

See edits.rs for exact calculation.

So, for example, if the env var has a value > 0.32 then we get

image

In practice I'm thinking that values somewhere above 0.6 will be appropriate but it would be great to hear what you find works with this version of the algorithm.

@0xC0FFEE
Copy link

0xC0FFEE commented May 1, 2020

Thanks for providing a usable version of your idea so promptly. I'll test-drive this next week and report interesting results back. Kudos!

@dandavison
Copy link
Owner

Just to be clear, I put this up simply because it was very easy to do from what we have already; there could well be merit in more involved changes.

@dandavison
Copy link
Owner

If anyone tries out DELTA_EXPERIMENTAL_MAX_LINE_DISTANCE_FOR_NAIVELY_PAIRED_LINES then it would be really helpful to hear about your findings! I think that setting it to a large value (closer to 1.0) solves many of the "I was expecting this to be highlighted but it isn't" issues. However, we need to have some control on it to prevent it inferring "edits" between totally unrelated lines.

This aspect of delta development (improving the within-line highlight algorithms and options) is one that could benefit a lot from having multiple people's thoughts (since we use different languages, work in code bases with different styles, etc).

@phillipwood
Copy link
Contributor

delta --max-line-distance 0.8
image

With PR #1244 this is now highlighted correctly without having to tweak max-line-distance
image

@phillipwood
Copy link
Contributor

So, this needs more thought but off the top of my head here are some considerations:

* Whether to match lines "naively" when there are equal numbers of red and green lines

* If so, I think we still need some sort of threshold to avoid "forcing" noisy/spurious edits on two completely different lines.

* If not, then what algorithm is used for deciding which lines are homologous?   The original (current) algorithm uses a needleman-wunsch / levenshtein edit distance, and demands that that exceeds a certain threshold in order for two lines to be considered homologs. It's greedy, to avoid a quadratic-time search over all matching pairs (i.e., basically, assumes the lines are not out of order).

I wonder if it would be practical to use the patience diff algorithm to pair up lines based on unique tokens within them and then use the current LCS calculation to highlight them once they're paired. That would avoid a quadratic search to pair the lines up while hopefully avoiding the suboptimal matches from the greedy matching that delta currently uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants