Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

Tree difference does not work correctly with unnormalized Unicode names #1057

Closed
vmarkovtsev opened this issue Jan 30, 2019 · 7 comments
Closed

Comments

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Jan 30, 2019

Tree difference algorithm does not handle unnormalized Unicode names correctly (Tree listing does it correctly, however). Here is how to reproduce:

git clone git://github.com/dotnet/cli /tmp/cli
go run bug.go /tmp/cli

bug.go:

package main

import (
    "os"
	"strings"

	"gopkg.in/src-d/go-git.v4"
	"gopkg.in/src-d/go-git.v4/plumbing"
	"gopkg.in/src-d/go-git.v4/plumbing/object"
)

func main() {
    r, err := git.PlainOpen(os.Args[1])
    if err != nil {
        panic(err)
    }
	c, err := r.CommitObject(plumbing.NewHash("55c59d621ea22921ecaabd99266d45a7921aab70"))
	if err != nil {
        panic(err)
    }
	t1, err := c.Tree()
	if err != nil {
        panic(err)
    }
	t1, err = t1.Tree("TestAssets/TestProjects")
	if err != nil {
        panic(err)
    }
    files := map[string]bool{}
	t1.Files().ForEach(func(f *object.File) error {
		files[f.Name] = true
		return nil
	})
	
	
	c, err = r.CommitObject(plumbing.NewHash("6fcbefa4f7a0016a68d3cda52779298a5cd20837"))
	if err != nil {
        panic(err)
    }
	t2, err := c.Tree()
	if err != nil {
        panic(err)
    }
	t2, err = t2.Tree("TestAssets/TestProjects")
	if err != nil {
        panic(err)
    }
	
	diff, err := object.DiffTree(t1, t2)
	for _, d := range diff {
		if strings.HasPrefix(d.To.Name, "TestAppWithUnico") &&
			strings.HasSuffix(d.To.Name, "Program.cs") {
			println(d.String())
			println(files[d.To.Name])
		}
	}
}

We see:

<Action: Insert, Path: TestAppWithUnicodéPath/Program.cs>
true

The expected output is empty.

Here is what is happening. 55c59d621ea22921ecaabd99266d45a7921aab70 and 6fcbefa4f7a0016a68d3cda52779298a5cd20837 are two consecutive commits.

cd /tmp/cli
git checkout 55c59d621ea22921ecaabd99266d45a7921aab70
echo TestAssets/TestProjects/TestAppWithUni*

git checkout 6fcbefa4f7a0016a68d3cda52779298a5cd20837
echo TestAssets/TestProjects/TestAppWithUni*

Output:

TestAssets/TestProjects/TestAppWithUnicodéPath

TestAssets/TestProjects/TestAppWithUnicodéPath TestAssets/TestProjects/TestAppWithUnicodéPath

There are two almost identical directories. One is in normalized Unicode, the other is not.

@mcuadros
Copy link
Contributor

mcuadros commented Feb 6, 2019

Linux doesn't normalize unicode in fs, so this means that the directories are different. git behaves different?

@vmarkovtsev
Copy link
Contributor Author

Yes, Cgit also treats files as different.

@vmarkovtsev
Copy link
Contributor Author

However:

git ls-tree 0d2105a8250ff1b9abb745a83462cc84d15f0699

040000 tree f14424931f1ec29d238c0fc12d150cd67158cae9	"TestAppWithUnicode\314\201Path"
040000 tree f14424931f1ec29d238c0fc12d150cd67158cae9	"TestAppWithUnicod\303\251Path"

This is the reason why our diff tree algorithm thinks those directories are the same - it relies on Git hashes.

@vmarkovtsev
Copy link
Contributor Author

Tagging @alcortesm to let him know

@vmarkovtsev
Copy link
Contributor Author

Because the directory names are precomposed as @jfontan found: https://sourcegraph.com/github.com/git/git@d62dad7/-/blob/compat/precompose_utf8.c#L113:25

@vmarkovtsev
Copy link
Contributor Author

Interesting observation: the original tree diff algorithm checks directory names first, and then proceeds to checking the hashes: https://github.com/git/git/blob/master/tree-diff.c#L483

@vmarkovtsev
Copy link
Contributor Author

Removing the normalization from https://github.com/src-d/go-git/blob/master/utils/merkletrie/noder/path.go#L83 resolves the original issue. I am going to prepare the PR with tests.

vmarkovtsev added a commit to vmarkovtsev/go-git that referenced this issue Feb 11, 2019
Fixes src-d#1057

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
vmarkovtsev added a commit to vmarkovtsev/go-git that referenced this issue Feb 11, 2019
Fixes src-d#1057

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants