Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly (?) matches files inside directories that do match #41

Closed
adrienverge opened this issue Oct 8, 2020 · 13 comments
Closed

Incorrectly (?) matches files inside directories that do match #41

adrienverge opened this issue Oct 8, 2020 · 13 comments

Comments

@adrienverge
Copy link
Contributor

Hello @cpburnz, I've looked existing issues but couldn't find this one. It was initially reported in yamllint issues: adrienverge/yamllint#334 (yamllint uses pathspec).

Considering these files:

file.yaml
dir/
└── dir/file.sql
dir.yaml/
└── dir.yaml/file.sql        ← problem here

pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml']) does match file.yaml, dir.yaml, but also dir.yaml/file.sql. The latter isn't what I would expect, because *.yaml feels like "files that end with .yaml", not "any file in any path that contains .yaml".

Is it the expected behavior? If yes, what is the correct pattern for "files that end with .yaml"?

Looking forward to your answer, and thanks for this very useful project 👍

@cpburnz
Copy link
Owner

cpburnz commented Oct 10, 2020

Hi @adrienverge,

Due to the current implementation it makes sense to me why dir.yaml/* is matching the pattern *.yaml. A dot in a folder name resembling a file extension is valid. There is not currently a way to tag a pattern to only match against files. This is the expected behavior. The most straight forward way to exclude directories with the .yaml extension is to add a negative pattern for it. E.g.,

p = pathspec.PathSpec.from_lines('gitwildmatch', [
    '*.yaml',
    '!*.yaml/'
])
p.match_file('file.yaml')
# True
p.match_file('dir.yaml/file.sql')
# False

Please let me know whether this works for you. I realize this may or may not work depending on your use case.

@adrienverge
Copy link
Contributor Author

Thanks for your answer @cpburnz! To complete it, I notice that Git + .gitignore behaves the same: if it contains a line *.yaml, it will also ignore files like dir.yaml/file.sql.

The ['*.yaml', '!*.yaml/'] pattern looks the right way to go 👍

However, while it works with Git and .gitignore, it doesn't seem to work with pathspec. More precisely, with the following tree, .gitignore successfully matches dir/file.yaml and dir.yaml/file.yaml, but pathspec doesn't match dir.yaml/file.yaml.

dir
├── index.txt     ← already tracked by git
├── file.sql
└── file.yaml
dir.yaml
├── index.txt     ← already tracked by git
├── file.sql
└── file.yaml

Said differently:

p = pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml', '!*.yaml/'])
p.match_file('dir/file.yaml')       # True
p.match_file('dir.yaml/file.yaml')  # False ← I would expect True (like with .gitignore)

adrienverge added a commit to adrienverge/yamllint that referenced this issue Oct 10, 2020
@cpburnz
Copy link
Owner

cpburnz commented Oct 10, 2020

@adrienverge Oops, I overlooked that. This can still be accomplished be adding an addition pattern to reinclude .yaml files under a .yaml directory:

p = pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml', '!*.yaml/', '*.yaml/**/*.yaml'])
p.match_file('dir/file.yaml')      # True
p.match_file('dir.yaml/index.txt') # False
p.match_file('dir.yaml/file.yaml') # True

The *.yaml/**/*.yaml pattern is redundant for Git and does not affect its results. The cause for this difference is pathspec strictly implements Git's wildmatch patterns while Git does additional processing for .gitignore such as as tracking included/excluded directories.

@adrienverge
Copy link
Contributor Author

Thanks again @cpburnz!

Unfortunately this recursive pattern including/excluding seems to bring new problems:

  1. While the first two patterns match paths (e.g. dir/dir/dir/file.yaml), the third pattern doesn't work. It seems that I need to add **/ at the beginning, so it would be:

    '*.yaml',
    '!*.yaml/',
    '**/*.yaml/**/*.yaml',
  2. With these new patterns, now dir.yaml/dir.yaml/index.txt will be matched. We would then need to add two new patterns:

    ...
    '!**/*.yaml/**/*.yaml/',
    '**/*.yaml/**/*.yaml/**/*.yaml',

    ... but we can recurse for dir.yaml/dir.yaml/dir.yaml/index.txt, so we would need an infinite list of patterns.

  3. The problems is actually even worse, because we need to match different extensions/patterns. By default it's *.yaml, *.yml and .yamllint but users can set any pattern using configuration.

    Which means we would need to add (for the simple use case with just 3 extensions):

    '**/*.yaml/**/*.yaml',
    '!**/*.yaml/**/*.yaml/',
    '**/*.yml/**/*.yaml',
    '!**/*.yml/**/*.yaml/',
    '**/.yamllint/**/*.yaml',
    '!**/.yamllint/**/*.yaml/',
    '**/*.yaml/**/*.yml',
    '!**/*.yaml/**/*.yml/',
    '**/*.yml/**/*.yml',
    '!**/*.yml/**/*.yml/',
    '**/.yamllint/**/*.yml',
    '!**/.yamllint/**/*.yml/',
    '**/*.yaml/**/.yamllint',
    '!**/*.yaml/**/.yamllint/',
    '**/*.yml/**/.yamllint',
    '!**/*.yml/**/.yamllint/',
    '**/.yamllint/**/.yamllint',
    '!**/.yamllint/**/.yamllint/',
    ... + infinite recursive patterns from bullet 2.

The cause for this difference is pathspec strictly implements Git's wildmatch patterns while Git does additional processing for .gitignore such as as tracking included/excluded directories.

So I understand that pathspec implements the standard, while Git adds some user-friendly, non-standard processing.
Given the bullets above, and the fact that the need "match all paths whose filename end with *.yaml or *.yml" cannot be achieved without an infinite pattern list, do you think it would be doable to add an option to pathspec? Would it be easy? Can we help in any way?
If the answer is no, are you aware of another library that could achieve this goal?

(PS: I don't know how others use your useful library pathspec, but my guess is that 99% are looking for something that behaves exactly like .gitignore. If I'm wrong I would be curious to know the other use-cases!)

@cpburnz
Copy link
Owner

cpburnz commented Oct 13, 2020

@adrienverge That is indeed problematic without a good solution for current feature set.

So I understand that pathspec implements the standard, while Git adds some user-friendly, non-standard processing.

Precisely

Given the bullets above, and the fact that the need "match all paths whose filename end with *.yaml or *.yml" cannot be achieved without an infinite pattern list, do you think it would be doable to add an option to pathspec? Would it be easy? Can we help in any way?

It's certainly doable. I think the ideal implementation would be a class implementing the .gitignore specific behavior, possibly a subclass of PathSpec. I welcome a pull request implementing it.

If the answer is no, are you aware of another library that could achieve this goal?

The answer isn't no, but I am aware of another library that tries to specifically implement .gitignore behavior: igittigitt. I have no experience with it myself.

(PS: I don't know how others use your useful library pathspec, but my guess is that 99% are looking for something that behaves exactly like .gitignore. If I'm wrong I would be curious to know the other use-cases!)

That seems to be the case. My original use case was I wanted a library with glob-like file matching similar to git or rsync.

@adrienverge
Copy link
Contributor Author

I welcome a pull request implementing it.

Thanks again for following this up @cpburnz 👍

@liopic @dafyddj @jsok you are the ones affected by this (via adrienverge/yamllint#279 and adrienverge/yamllint#334), would you like to contribute to python-pathspec and implement a .gitignore-like subclass of PathSpec?

If not, I can try to find time myself but I don't know precisely when it could be. I already went forward and found this relevant code for do_match_pathspec() in Git source: https://github.com/git/git/blob/d4a3924/dir.c#L432-L501
@cpburnz if you have any pointers or documentation on this "extra processing" that Git does, it would be more than welcome :)

@dafyddj
Copy link

dafyddj commented Oct 15, 2020

@adrienverge Interesting to read, but I am not a developer so I don't think I can help with this - but good luck!

@excitoon
Copy link

Guys this is not a bug. Git works exactly as the spec says:

If there is a separator at the end of the pattern then the pattern will only match directories, otherwise the pattern can match both files and directories.

https://git-scm.com/docs/gitignore

@cpburnz
Copy link
Owner

cpburnz commented Aug 28, 2022

@excitoon I think it's an edge case not adequately described in the gitignore specification. Git behaves exactly as described in this issue, and at the moment pathspec behaves differently. I'm working fixing the discrepancy.

@excitoon
Copy link

@cpburnz You can check how I solved it in my module.

First, making a group for extra path parts:
excitoon/gitignorefile@6763d47#diff-21e52cb6c7e0acafac258d280cac806bcf75715e4207b85d451e0c9e88d9f2b5R305-R306

Second, analyzing if anything matched. If it did not, we shall query filesystem to check if it is a directory:
excitoon/gitignorefile@6763d47#diff-21e52cb6c7e0acafac258d280cac806bcf75715e4207b85d451e0c9e88d9f2b5R234

@excitoon
Copy link

One more example:

core
!core/

@cpburnz
Copy link
Owner

cpburnz commented Aug 31, 2022

There will be a way to handle this scenario in the v0.10.0 release. The new class, GitIgnoreSpec, will need to be used. It will prioritize patterns that directly match the file over an ancestor directory. It can be used as:

# The "gitwildmatch" pattern factory is assumed by GitIgnoreSpec.
p = pathspec.GitIgnoreSpec.from_lines(['*.yaml', '!*.yaml/'])
p.match_file('dir/file.yaml')       # True
p.match_file('dir.yaml/file.yaml')  # True

@cpburnz cpburnz closed this as completed Aug 31, 2022
@adrienverge
Copy link
Contributor Author

Thanks @cpburnz!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants