Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffLines seems broken #68

Closed
wfalkwallace opened this issue Aug 10, 2015 · 10 comments
Closed

diffLines seems broken #68

wfalkwallace opened this issue Aug 10, 2015 · 10 comments
Milestone

Comments

@wfalkwallace
Copy link

screenshot 2015-08-09 20 18 47

@wfalkwallace
Copy link
Author

UPDATE: It seems like it's treating the trailing newline incorrectly:
screenshot 2015-08-09 21 25 11
and diffTrimmedLines is doing the same. It's seeing the trailing newline difference and treating the line as changed. Is that really intended?

@kpdecker
Copy link
Owner

What browser are you seeing this on?

@wfalkwallace
Copy link
Author

Chrome 44.0.2403.130 (64-bit) on a MacBook Pro

the diffLines seemed weird, but ¯\_(ツ)_/¯ -- I'm not sure how to count the newline wrt added/removed. But diffTrimmedLines should definitely fix the above situaiton, no?

@kpdecker
Copy link
Owner

Thanks. This is not browser specific but an issue with the newline being included in the tokenizer's output, so the rest of the algorithm considers them to be distinct.

[ 'restaurant' ] [ 'restaurant\n', 'hello' ]

Under the trimmed case, we even go as far as explicitly adding these back in after the trim:

line += '\n';

The reason that all of this is done is that the string can not be reconstructed without these newlines. All of the operations that combine tokens just do + (More accurately a join('')) and omitting the \n from the token would cause that to merge all of the lines together.

It is possible to do a custom comparison the ignores the line for the sake of comparison checks but this too could run into issues with the newline values sometimes being included and sometimes being not, based on the diff path that is selected for two given inputs.

I suspect that this is basically the reason that \\ No newline at end of file exists in GNU diff's output and I'm not sure that I have a good answer to this other than inserting a newline into the end of the string if it is omitted.

@wfalkwallace
Copy link
Author

I figured it was the line tokenizer. I see this as a real bug though, since it's arbitrarily adding back newlines, which means the value isn't necessarily valid (ie. it may not incorporate a trailing newline and will break non-unix newlines).

Do you need to reconstruct the string from the tokens at any point? Why not use character offsets attached to your tokens, the way esprima (for example) does in its range property? That would allow you to grab valid substrings from the original content.

Even if you don't want to go that far, I still think it's worth it to add the newlines back after the comparison, or trim again when comparing -- right now diffTrimmedLines('restaurant', 'restaurant\nhello') spits out [{value: 'restaurant', added: undefined, removed: true, count: 1}, {value: 'restaurant\nhello', added: true, removed: undefined, count: 2}] which just isn't correct.

@kpdecker
Copy link
Owner

Regarding the arbitrary newline insertion, I merged the patch and line diff implementations last night since I ended up noticing they were basically the same while investigating this issue. This change should resolves that particular issue: 1597705

Regarding ranges, that's a ton of work that will break all custom diff implementations. That's not going to be feasible anytime in the near term. It is possible to tokenize or compare ignoring the newline characters but this ties into the issue above of newlines sometimes being included and sometimes not (ranges suffer from the same problem)

Regarding correctness, the code is operating correctly under the definition that a line is inclusive of it's newline character. This is needed to properly create patches, etc. Unfortunately this directly conflicts with making human readable diffs, like I presume is what you are going for.

Using a tokenizer similar to the following, the behavior is closer to what I think you are looking for:

tokenize(value) {
 return value.split(/(\n|\r\n)/);
}

This results in [ 'restaurant' ] and [ 'restaurant', '\n', 'hello' ] and an eventual diff of restaurant<ins>\nhello</ins>.

Basically I think a technical solution has been found... but implementing it without breaking other use cases might be a concern. I need to think about the best way to expose this as an API but will try to include something in the next release that exposes this behavior, one way or the other.

@kpdecker kpdecker added this to the 2.1.0 milestone Aug 11, 2015
@wfalkwallace
Copy link
Author

sounds good on the arbitrary reinsertion, and the truth is I'm not particularly concerned with microsoft crlf's (but did notice it).

I agree the definition that a line is inclusive of its newline character makes sense for diffLines but it doesn't for diffTrimmedLines (trim typically includes line breaks) which would mean diffTrimmedLines is not operating correctly. Is there any other way around this without a major overhaul and apart from a fork of the repo for me?

your solution might work though - thanks for adding this to the roadmap!

@kpdecker
Copy link
Owner

Trimmed I think we can reasonably go the route I was proposing and this is not a major change. For the non-trimmed implementation, it's more of that I need to figure out what to call it as well as finalize the other features in this release. diffLinesWithoutNewLines is a bit clunky.

You can also do a non-trimmed implementation in the interim using something like the following:

var myDiff = new JsDiff.Diff();
myDiff.tokenize = function(value) {
  return value.split(/(\n|\r\n)/);
};
myDiff.diff(oldStr, newStr);

@kpdecker
Copy link
Owner

Functionality implemented via diffLinesNL, I'm not a huge fan of the name but wasn't sure of a better option. Open to suggestions if there is something better suggested before this is in a tagged release.

@kpdecker
Copy link
Owner

Released in 2.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants