Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cooked Glob and Regex? #114

Closed
jaa127 opened this issue Feb 13, 2017 · 6 comments
Closed

Cooked Glob and Regex? #114

jaa127 opened this issue Feb 13, 2017 · 6 comments
Labels

Comments

@jaa127
Copy link
Contributor

jaa127 commented Feb 13, 2017

Would it to make sense to support cooked glob and regex with better-files?

At the moment better-files glob works as:

     // root_a = basedir / "a"
     // basedir / "a" / "a1" / "t1.txt"
     // basedir / "a" / "a1" / "t2.txt"

     root_a.glob("a1/*.txt").foreach(println) => finds nothing
     root_a.glob("**/a1/*.txt").foreach(println) => finds t1, t2

With cooked glob it would be:

     root_a.glob("a1/*.txt").foreach(println) => finds t1, t2

Cooked glob or regex works so that it "cooks" basepath to wildcard (glob or regex) if following is true:

  • wildcard is not absolute path
  • wildcard does not start with glob or regex special character

This cooked form makes it possible to write more natural glob, when at the beging there doesn't have to be cross-path component regex or glob. This especially important when these glob/regex are used on configuration files, where non-programming human has to understand how they work.

If this makes sense with better-files, I can provide PR for this feature with tests. There is an existing implementation here (I am author of SN127, so MIT licensing is not problem):

findFiles with support for cooked globs and regex:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/main/scala/fi/sn127/utils/fs/FileUtils.scala#L204

Glob-cooking:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/main/scala/fi/sn127/utils/fs/FileUtils.scala#L270

Regex-cooking:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/main/scala/fi/sn127/utils/fs/FileUtils.scala#L288

Glob-cooking tests:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/test/scala/fi/sn127/utils/fs/GlobTest.scala

Glob-findFiles tests:
https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/fs/src/test/scala/fi/sn127/utils/fs/GlobTest.scala#L236

Glob-findFiles target:
https://github.com/sn127/utils/tree/b116036de96f7b66fba29117ce91168bf4323c45/tests/globtree

And finally here is an example how this cooked form is used in "end-product". This is DirSuite scalatest extension, which let you define your tests as inputs and output references on filesystem:

https://github.com/sn127/utils/blob/b116036de96f7b66fba29117ce91168bf4323c45/testing/src/test/scala/fi/sn127/utils/testing/DirSuiteDemo.scala#L49

@pathikrit
Copy link
Owner

Hmm interesting. Although this certainly seems useful, I have never came across "glob cooking". Is this something that is known outside sn127? If not, I don't think better-files is the right place for this?

If we were to incorporate it into better-files we have couple of options:

  1. Add a boolean cook parameter to the File.glob util

OR

  1. Add new PathMatcherSyntax here: http://pathikrit.github.io/better-files/latest/api/better/files/File$$PathMatcherSyntax$.html

I would suggest the latter.

@jaa127
Copy link
Contributor Author

jaa127 commented Feb 13, 2017

git's gitignore works kind of same way. With it you don't have to provide path prefix-glob, if gitignore is inside subdir. If it is on top level, then there has to be path-prefix glob. So git in some sense "cooks" current directory to the glob.

On public software side of things, this is used with DirSuite, which is used by Abandon.

Why this has been handy so far:

  • This glob-cooking has been really handy so far and especially it makes conf-settings look more clear when you don't have to count starts at the begin
  • It is marginally faster, because the begin of regex is fixed string instead of wildcard, and matcher doesn't have to scan whole string (it can exit early). This is probably totally negligible in real life.

If this lands on better-files, then new PathMatcherSyntax would definitely makes sense. Then it would be clear which one it is and it would be also possible to find direct child sub-directories with wildcard, without matching deeper subdirectories.

 basedir.cookedGlob("*/12/**.txt")

And when does that happen? If e.g. you shard iso-dates by year, month, and day and you have to find all items from December (12) over multiple years. With normal glob, if there is path-prefix crossing glob as first wildcard, it will match all days which are "12", if there isn't path-prefix crossing glob as first wildcard, it won't match anything.

Maybe this could be thought as ls command, you don't have to provide **/*.txt because you are already inside directory. With bettern-files syntax this is even more important (imho):

List all *.txt under a1:

   a1.glob("**/*.txt")

vs.

   a1.glob("*.txt")

Based on above, maybe it could be argued that cooked glob should be default, and non-cooked could be rawGlob?

@pathikrit
Copy link
Owner

pathikrit commented Feb 13, 2017

@jaa127 : Okay let's put this in better-files and make it default. Let's avoid the word "cooking" since that seems non-standard.

Since we would be breaking backwards compatibility, this needs to go in v3

Also, please document this in the README since it would deviate from UNIX/Java's glob behaviour.

@jaa127
Copy link
Contributor Author

jaa127 commented Feb 14, 2017

That's great, thanks! If there are some oddities or if we have second thoughts about this to be default, then this can be revisited later, before releasing.

Do you have an idea when v3.0.0 should be ready (in days, weeks, months)?

@pathikrit
Copy link
Owner

pathikrit commented Feb 14, 2017

Do you have an idea when v3.0.0 should be ready (in days, weeks, months)?

Weeks.

If you want to use it now, you can depend on 2.17.2-SNAPSHOT

jaa127 added a commit to jaa127/better-files that referenced this issue Feb 15, 2017
Support automatic path prefix with glob (`pathGlob` and `pathRegex`).
In this mode, pattern is prefixed with `File`-instance's path.

Standard java PatternMatch names `regex` and `glob` behaves
as they do in java world.

Default is `pathGlob`, that default behaviour is similar
with  Unix `ls` or with
[Python 3.5+ glob](https://docs.python.org/3/library/glob.html)

Please see pathikrit#114 for background information.
@jaa127
Copy link
Contributor Author

jaa127 commented Feb 15, 2017

Hi, here is the PR. It's good if there are some time before 3.0.0, so there is time to adjust this if needed.

I noticed while looking some external references that new Python glob works same way than this implementation. Here is Python glob doc: https://docs.python.org/3/library/glob.html

There is one difference between python's glob.glob('**/*.txt', recursive=True) implementation and this implementation: Python lists all files also in current directory, even when there is that slash. Java doesn't do that.

It could be that Python is in fact wrong in that case, because ls **/*.txt works same way than this implementation.

ls **/*.txt
a/a.txt  a/x.txt  b/b.txt  c/c.txt  c/x.txt

pathikrit added a commit that referenced this issue Feb 20, 2017
pathikrit added a commit that referenced this issue Feb 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants