Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob support in core #38

Open
boneskull opened this issue Sep 3, 2019 · 37 comments
Open

glob support in core #38

boneskull opened this issue Sep 3, 2019 · 37 comments

Comments

@boneskull
Copy link
Collaborator

I'm curious if anyone thinks adding glob support to core would be a good idea?

Globs (globspecs? what's the right term?) are natively supported in POSIX (?) shells, which means you can do this:

$ some-cli-app **/*.foo.js

This is expanded by the shell before execution of some-cli-app into a list of files matching the glob.

But we get in to trouble when our scripts (e.g., in package.json) use globs, because they will not be expanded, and the scripts will fail (or otherwise exhibit unexpected behavior) on Windows.

To support globs in a portable way, any CLI app would need to pull in glob or a similar module, and pass the glob as a quoted string in the arguments:

$ some-cli-app "**/*.foo.js"

(On Windows, you must use double-quotes above, but using single-quotes in POSIX shells means env variables [e.g., $FOO] will be treated as literals; env variables are another can of worms)

The utility of glob has been well-established, and it's used in enough CLI apps that it may be a good candidate for vendoring or some other implementation. I haven't looked at the glob source, so it's unclear whether vendoring would be the best route, but I do know it contains several other dependencies which would also need to be pulled in (right @iansu?).

Thoughts?

cc @isaacs

@iansu
Copy link
Contributor

iansu commented Sep 3, 2019

I do think this is a good idea but as you point out glob does have 6 dependencies, including minimatch so it's unclear to me how we would vendor it.

@coreyfarrell
Copy link
Member

globs are not supported by bash v5 on Fedora 30. Running some-cli-app **/*.foo.js will produce the same result as some-cli-app */*.foo.js - the double-star is not supported.

As for pulling glob into the core I'm not sure. I think it might be more appropriate to first ask if minimatch functionality belongs in the core. If so I think it would be important to evaluate pros/cons of multiple modules with similar functionality to determine which one would be best for the core (minimatch and micromatch each have about 15 million downloads per week).

@ljharb
Copy link
Member

ljharb commented Sep 3, 2019

Glob support would be great precisely because it’s not consistent across platforms, but is needed for things like git/npm/eslint ignore, eslint overrides, etc.

@jonschlinkert
Copy link

As often as globbing is used, and since most developers really don't realize how many edge cases and platform-specific issues need to be considered to produce a consistent experience, IMHO it would be great to add native support.

If Node.js does decide to include this functionality, I would be willing to put some time into helping (assuming that's of any interest). Either way, my recommendation is that you look at the matching and fs operations as completely separate chunks of logic.

  • minimatch and picomatch are both focused on the former (just the matching)
  • glob is focused on the latter (but is currently coupled to minimatch)

If it's decided that this functionality should be re-written directly in Node.js, I'd be happy to take a stab at contributing that logic, starting with just the matching. I think @isaacs has more experience and knowledge of the fs logic and cross-platform considerations than I do. But I'd be willing to help with that as well.

If you decide to use external libs for any of this, I would of course also be willing and interested in collaborating on making any adjustments, improvements, or changes to our libs that are necessary to make them meet your requirements.

@sindresorhus
Copy link

sindresorhus commented Sep 4, 2019

I would recommend also taking a look at fast-glob which is both faster and better maintained than glob.

// @mrmlnc

@ljharb
Copy link
Member

ljharb commented Sep 4, 2019

Let's not throw subjective shade at other packages, though.

@jonschlinkert
Copy link

jonschlinkert commented Sep 4, 2019

sorry just saw your comment @ljharb (after I did thumbs up)

@boneskull
Copy link
Collaborator Author

We're meeting tomorrow and this will be on the agenda. To everyone offering comments: please attend if you're available!

@boneskull
Copy link
Collaborator Author

@sindresorhus It sounds like you'd be in support of pulling fast-glob into core, then?

@jonschlinkert
Copy link

@boneskull I'd be in support of that decision too :)

@sindresorhus
Copy link

sindresorhus commented Sep 5, 2019

@ljharb It was not my intention to "throw a subjective shade". I'll be better at justifying my opinions in the future. I just wanted to mention an alternative.

Here are benchmarks comparing glob and fast-glob: https://gist.github.com/mrmlnc/f06246b197f53c356895fa35355a367c#file-fast_glob_benchmark_server-txt

By "better maintained", I meant more actively maintained. Poor word use on my part. glob has almost not had any real code commits in 3 years.

For context, I built globby (15M weekly downloads now) some years ago to provide a subjectively nicer API for glob with a Promise API, support for multiple patterns, negated patterns, etc. Then fast-glob came along which fixed a lot of bugs found in glob and I eventually switched globby over to use fast-glob. In my subjective opinion, fast-glob has a better API and codebase.

@sindresorhus
Copy link

It sounds like you'd be in support of pulling fast-glob into core, then?

Yes

@boneskull
Copy link
Collaborator Author

OK, we discussed this at today's meeting

  • Both glob and the underlying matcher should be available for consumption
  • To start, let's create a PR which adds matching support (a la minimatch or micromatch)
  • Suggested API: path.match()
  • "gitignore-style" matching is also widely used, and this should be a use case
  • @iansu and/or myself will provide a comparison of micromatch vs. minimatch vs. bash vs. gitignore to find the intersection of features; we can narrow down an API from there
  • After this, there are several options w/r/t adding glob support to fs, and we can cross that bridge later

gimme feedback

@isaacs
Copy link

isaacs commented Sep 6, 2019

I would be thrilled to hand glob over to node core.

That being said, there are a few rough edges that ought to be polished down first, so here I go brutally trashing my own work :)

The sync/async management is (overy) clever. It works fairly well, but it's not my favorite approach. Since it somewhat predates proper Classes in JavaScript, glob uses a different approach than most of my other more recent libraries that walk over trees. (Eg, tar, npm-packlist, etc.) The code is a bit brittle and hard to work with. It's also very (overly) optimized for performance, and meticulously designed to match Bash 4's glob behavior.

None of which would in itself be a problem, if glob was actually a good cross-platform general-purpose file system globber. We could call it done and let it age gracefully. However, it has some issues with UNC paths on Windows, and lacks proper promise support, both of which are very challenging to fix in its current state.

I'd love to give it a rewrite, and may someday, but honestly, if it was a priority, I probably would've done it by now. Maybe the best approach is to use it as a reference implementation, but rewrite it, fixing all the bugs and preserving the tests and actual glob resolution behavior. It works well enough for a lot of people right now (including me!), but caveat emptor: if you pull it in as-is, you might be welcoming a very stinky camel into your tent. Make sure it's not too hard to replace, if you do so.

@isaacs
Copy link

isaacs commented Sep 6, 2019

By "better maintained", I meant more actively maintained. Poor word use on my part. glob has almost not had any real code commits in 3 years.

"Better maintained" is actually a fair and accurate assessment. No shade taken :)

@boneskull
Copy link
Collaborator Author

boneskull commented Sep 17, 2019

I'm working to document this initiative, so here's a rough draft of what I have, including a comparison of existing solutions. If I'm wrong here, please say so. I'll send a PR after it's had a bit of review, but wanted to keep everything in this issue for now.


Globbing in Node.js core

See nodejs/tooling#38.

Motivation

TL;DR: Runtime handling of Globs is needed for better portability

Globs (henceforce capitalized) are natively supported in shells like Bash, which means you can do this, which will recursively find files matching *.foo.js, from your current working directory:

$ some-cli-app **/*.foo.js

Bash expands the above before execution of some-cli-app into a list of files matching the Glob.

But we get in to trouble when our scripts (e.g., in package.json) use Globs, because Windows won't necessarily expand these, and the scripts will fail (or otherwise exhibit unexpected behavior).

To support Globs in a portable way, any CLI app would need to pull in glob or a similar module, and pass the Glob as a (double-)quoted string in the arguments:

$ some-cli-app "**/*.foo.js"

The utility of glob and others like it has been well-established, and it's used in enough CLI apps that the functionality belongs in Node.js core.

Roadmap

The first thing that needed is a matching implementation. A Glob implementation (which uses the matcher to compare filenames) would then follow.

We expect the new function(s) would be added to the path module. That said, we must be careful to not conflate filepaths and Globs, because they are not the same.

Matching Implementation

An underlying matching implementation is a prerequisite to Glob support in Node.js core. To determine what we should be implementing, we'll compare existing solutions.

Of particular concern is how path separators are handled. While Globs are not filepaths, developers often treat them as such. The output of path.join() will be system-dependent, and if its output is passed into a matching implementation, the result may be unexpected. Please see micromatch's notes on the topic.

Comparison of Prior Art

We'll compare the featureset of the following:

We've identified the latter two as being most popular packages which provide the underlying matcher to a Glob implementation.

We will consider the following relevant features:

  • Wildcards (*.js)
  • "globstar" (**.js, a/**/*)
  • Negation (!a/*.js)
  • "extglob" (+(x|y), !(a|b))
  • Character set (foo-[12345].js)
  • Character ranges (foo-[1-5].js)
  • Character classes ([[:alpha:][:digit:]])
  • Brace expansion (bar/{a,b,c}.js)
  • Brace expansion ranges (foo/{1..5}.md; equivalent to foo/[1-5].md)
  • Case-insensitive matching
  • Comments (#*.js)
  • Single-character (foo.?s)
  • Implicit dotfile matching

For purposes of this best-effort comparison, we'll consider Bash's full capabilities instead of its default capabilities. All comparisons should reflect the latest version (as of this writing) of the respective software.

If a user can toggle the specific behavior, that's noted with "Flag."

Wildcards Globstar Negation extglob Char set Char ranges Char class Brace Expansion Brace Ranges Case Comments Single-Char Dotfile
Bash Yes Flag Yes Flag Yes Flag Yes Yes Nob Flag Yes Yes No
gitignore Yes Yes Flaga No Yes Yes Yes Yes No No Yes Yes No
micromatch Yes Flag Flag Flag Flag Flag Flag Flag Flag Flag Noc Yes Flag
minimatch Yes Flag Flag Flag Yes Yes Yes Flag Flag Flag Flag Yes Flag
  • a: Use \! at the beginning of a pattern for a literal !
  • b: Bash treats "brace expansion ranges" ({1..5}) as sequences which can be used to create multiple patterns at once, but are not patterns themselves.
  • c: If micromatch does support comments, it's undocumented.

@boneskull
Copy link
Collaborator Author

My key takeaway: if you expected the implementation to be simple, you will be disappointed. 😄 I don't think we can get away with not offering a handful of flags, either...

@jonschlinkert
Copy link

c: If micromatch does support comments, it's undocumented.

Correct, micromatch doesn't support comments.

My key takeaway: if you expected the implementation to be simple, you will be disappointed. 😄 I don't think we can get away with not offering a handful of flags, either...

Indeed. I recommend also testing nested braces and extglobs, as well as braces and extglobs with nested negation patterns. Nested patterns are used more often than one might guess.

Regarding bash, both minimatch and micromatch have (intentionally) different behavior in some cases.

Also, it's important to note that .gitignore has completely different matching rules than glob. Both minimatch and micromatch can be used for matching gitignore patterns, but neither library has native support. FWIW I haven't seen any node.js libraries do this correctly yet.

@jonschlinkert
Copy link

Another thought.... joining paths to globs is not trivial, but it's totally possible to do it correctly, consistently, in a cross-platform compatible way.

@boneskull
Copy link
Collaborator Author

@jonschlinkert

Another thought.... joining paths to globs is not trivial, but it's totally possible to do it correctly, consistently, in a cross-platform compatible way.

In your opinion, does micromatch do this correctly?

Also, it's important to note that .gitignore has completely different matching rules than glob. Both minimatch and micromatch can be used for matching gitignore patterns, but neither library has native support. FWIW I haven't seen any node.js libraries do this correctly yet.

It seems it's implemented in its internal wildmatch.c. There's a JS port of wildmatch but I can't vouch for its accuracy. It also hasn't been published in 5 years.

@boneskull
Copy link
Collaborator Author

boneskull commented Sep 18, 2019

@jonschlinkert I'm curious. If you were going to pare down micromatch to its essence (there are a lot of options, as you'll probably agree), what would you keep? What would you drop?

(Since we're aiming to determine our "MVP", as it were)

@boneskull
Copy link
Collaborator Author

IMO, in our implementation, I could definitely do without:

  • implicit dotfile support (a flag that essentially ignores the leading dot of a filename when matching)
  • comments--what is the use case for this outside of a script or .gitignore file? multiline globs don't make much sense in-and-of themselves unless newlines are stripped anyway, right?
  • brace expansion ranges ({1..5}); too confusing in this context. in minimatch, this is equivalent to [1-5]. micromatch seems to have more granular control over the behavior. it veers into "shell expansion" territory. I'm not sure if this is needed at all, but I'm happy to be convinced otherwise

@boneskull
Copy link
Collaborator Author

pls note that on Friday we have a meeting and everybody is invited. bring a +1 if you want

@vweevers
Copy link

Same question as in #19: what can node core do better than userland? I'm not opposed, but I do think node core can aim higher than "users won't have to install glob" and that those goals should be more clearly formulated before diving into specifics.

@jonschlinkert
Copy link

jonschlinkert commented Sep 18, 2019

In your opinion, does micromatch do this correctly?

Micromatch doesn't do any kind of joining, it's all left to the user to figure this out - same as minimatch and node-glob. However, I did start implementing this behavior somewhere, and I've given examples in many issues over the past few years, and I'd be happy to discuss over code when the time comes.

In essence, it boils down to ensuring that all slashes are converted to posix slashes in a non-glob file path before converting it to a glob pattern. There are other nuances that would warrant a longer discussion.

If you were going to pare down micromatch to its essence (there are a lot of options, as you'll probably agree), what would you keep? What would you drop?

My first reaction is to drop brace expansion, like you mentioned. Especially since globs with braces can be expanded before passing them to the matcher. (as a side note micromatch expands {1..5}} to [1-5], minimatch expands {1..5} to ['1', '2', '3', '4', '5']). But these are contrived examples. Real world patterns are closer to things like account-S{001..990}, which would expand to account-S(0{0,2}[1-9]|0?[1-9][0-9]|[1-8][0-9]{2}|9[0-8][0-9]|990) with micromatch, and would be an array of 990 globs with minimatch).

Overall, as you think about which features to include or drop, I think the most important consideration is how globs are used. With CLI tools like eslint, end-users don't have the ability to pass options to the matcher, or customize behavior by, for example, expanding braces beforehand. Removing a bunch of features or options will just make globbing useless or painful to a lot of end-users, or cause implementors to find hacky workarounds. I found this out the hard way over the past few years. Globbing is hard to do correctly, and implementors always seem to underestimate the number of edge cases their users will experience with it (especially because of Windows paths, and the general misconception that globs are just fancy file paths). This isn't an argument for not having native glob support in Node, it's the opposite. I think we can smooth this out and create consistency for users.

If we look at this from 20,000 feet, what we think of as "globs" in Node.js is actually a combination of several different matching concepts, some from shell expansion, others from elsewhere:

  • filename expansion - which mostly includes wildcards: *, ?, and "globstars": **
  • brace expansion - which you pointed out above: {a,b,c} => ['a', 'b', 'c']
  • posix character classes - i.e. [:alpha:]
  • extglobs - !(foo|bar)

Beyond features, there are a handful of things to consider in determining behavior:

  1. Each implementation has slightly different rules - bash and zsh even use different rules
  2. Each has different options
  3. Some features, in combination with certain options, contradict one another
  4. There is no canonical specification for any of these (at least I have been unsuccessful in finding one)
  5. Expected behavior differs in some scenarios when using globs in Node.js versus bash.

there are a lot of options, as you'll probably agree

I do. I really, really do. lol. When I created the parser for the latest micromatch, I had to make decisions about how the potentially-conflicting rules of each matching concept should work together. Instead, I avoided making those decisions by allowing the user to do it via options.

If micromatch is pulled into Node, I would be more than happy to reduce options or hard-card certain behavior. This is not expected of course.

@jonschlinkert
Copy link

jonschlinkert commented Sep 18, 2019

Almost forgot, picomatch is the matcher behind micromatch.

(edit: I initially said that picomatch doesn't have brace support, but I forgot that it does in the latest release. However, micromatch adds support for more complex brace expansion features).

@boneskull
Copy link
Collaborator Author

@vweevers That depends what you mean by "better," I suppose. Node.js doesn't provide a "better" way than userland to build services out-of-the-box; but its net and httpx modules enable userland to build better services. While yours is a valid question, it's not the only question, and adding something to core should not hinge on the answer.

Another question to ask is "what's essential?" Recursive mkdir and rmdir--as well as argument parsing in #19--are essential for tools which operate on files, which is quite a lot of them.

So, glob support is essential, because:

  • It enables better cross-platform tooling. Node.js has struggled with portability in several areas, and contributing to a better story here seems like a clear win.
  • Tooling authors have to use glob/fast-glob/globby if they want more consistent cross-platform behavior anyway, which they do (or if they don't, they should).
  • Depending on implementation, it could be a performance improvement over pure-JS userland modules. I'm not necessarily suggesting to write it in C++, but it'd be an easier sell to a consumer than a native userland module.
  • Enables potential future integration with fs module. I don't want to speculate on what this'd look like, but it'd open up the opportunity.
  • Potential future integration with argument parsing #19.
  • Provide a "standard" or "blessed" way to handle slashes ("consistency", as @jonschlinkert mentions above)

@boneskull
Copy link
Collaborator Author

FWIW, Python's glob support is pretty barebones (it looks like they added globstar support in v3.5). I haven't used it myself, so don't know how Python tooling authors are getting along with it, but I'd sure like to know.

@coreyfarrell
Copy link
Member

IMO, in our implementation, I could definitely do without:

  • implicit dotfile support (a flag that essentially ignores the leading dot of a filename when matching)

What about modules like gulp where it might be desirable to match dot files (or files inside dot directories). For gulp.src it's probably easy enough to explicitly add patterns for dot-files and dot-directories which need to be copied but dot: true is easier than managing the extra glob patterns.

  • comments--what is the use case for this outside of a script or .gitignore file? multiline globs don't make much sense in-and-of themselves unless newlines are stripped anyway, right?

The one exception could be to work around the fact that JSON does not support comments. Completely ignoring any pattern matching /^#/ could allow comments in JSON glob array fields, for example:

{
  "files": [
    "**/*.js",
    "# reason for not publishing file.js",
    "!file.js"
  ]
}

I know this could be accomplished by pre-processing the array before passing to the matcher function but the idea here is consistency, for globs to eventually work the same everywhere. I don't see any value in supporting inline comments, for example "!file.js # reason for this ignore".

  • brace expansion ranges ({1..5}); too confusing in this context. in minimatch, this is equivalent to [1-5]. micromatch seems to have more granular control over the behavior. it veers into "shell expansion" territory. I'm not sure if this is needed at all, but I'm happy to be convinced otherwise

If this is supported should it automatically expand range as per the example at https://www.npmjs.com/package/micromatch#optionsexpandrange? Seems confusing that micromatch.makeRe('{10..15}') produces /^(?:[10-25]\.js)$/ by default instead of /^(?:(1[0-9]|2[0-5])\.js)$/.

@jonschlinkert
Copy link

it's probably easy enough to explicitly add patterns for dot-files and dot-directories which need to be copied but dot: true is easier than managing the extra glob patterns.

Agreed. For users, it's much more difficult to exclude unwanted dotfiles when they are included by default, than the other way around. Also, every globbing implementation I'm aware of excludes dotfiles by default. In bash, for example, you need to set dotglob to match hidden files.

It's also really common for users (or implementors) to use patterns like * to match all of the files in a directory, forgetting that some files are hidden (like the .DS_Store file in every directory on MacOS).

Seems confusing that micromatch.makeRe('{10..15}') produces /^(?:[10-25]\.js)$/ by default instead of /^(?:(1[0-9]|2[0-5])\.js)$/.

Agreed. I've been planning to change that to the default. Thanks for bringing it up.

Python's glob support is pretty barebones

That depends on what you mean by bare bones lol. It supports globstars, ranges like [1-5], and even environment variable expansion. As a side note, I did this on a branch on picomatch, after a discussion with the jest core team They support basic path variables, and it's hard to expand them correctly. It turned out to be really useful, and fast, since it only adds 5 or 6 lines of code to the parser.

@isaacs
Copy link

isaacs commented Sep 20, 2019

If we move forward on this, I'd suggest using node-glob's test suite, and picking a version of Bash to use as a reference implementation (node-glob uses Bash 4.4.), and re-implementing from first principles (perhaps by porting or importing the bash C implementation, though that might be really challenging). Globstar and extglob should be enabled by default, but it'd be nice to be able to turn them off.

The reason is that "globbing" has been extended and implemented in a variety of different ways, and no two implementations are going to be 100% identical. There is no specification that covers all the cases users expect. So, it's not strictly accurate that all "posix shells" implement "globbing" (at least for some values of "posix shell" and "globbing").

@coreyfarrell ** works on any Bash v4, but I think in the environment you tested, globstar was probably not on by default. shopt -s globstar should enable it.

@isaacs
Copy link

isaacs commented Sep 20, 2019

Also: note that ignore-file style globbing is somewhat different from shell style globbing (though there are similarities). See https://github.com/npm/ignore-walk

@boneskull
Copy link
Collaborator Author

We discussed this in the meeting today.

  1. Because there's no real standard to work off of--just "what other software does"--we're going to subject ourselves to a whole lot of potentially-difficult decision-making around what features we support out-of-the-box (much like in argument parsing #19).
  2. We'd have to provide an implementation in a way which would be extendable for tooling authors; the result should not be "take-it-or-leave-it"
  3. There are quite a few edge cases that need to be handled. This isn't so bad in-and-of itself, of course (see rimraf)

Looking at it holistically, the result is that those of us in the meeting felt like this is biting off more than we can chew. Personally, my excitement has dimmed to the point of not wanting to champion the effort (#19 will be taxing enough, and I do want to help see that one forward).

If anyone else here wants to take up the cause, nobody else here is stopping you--and I think we're happy to work with anybody who does want to move it forward--but it didn't sound like any of those in attendance have the requisite, uh, gumption.

@boneskull
Copy link
Collaborator Author

going to close this. somebody reopen if they want to champion.

@isaacs
Copy link

isaacs commented Jan 30, 2020

I expect that I'll have some time to pick this up and champion it after npm v7 and tap v15 both ship. Glob's next on my list. Probably near the end of 2020, or early 2021, so most likely it'll be a node v15 feature.

I believe that the goal of an implementation should be to match the behavior of some specific and recent version of Bash, as it's the most widely used thing that people glob with. (But only as a reference implementation -- my explorations in the past showed that the actual code in bash tends to intermingle globbing with other string parsing features in a way that is very performant for a shell, but very bad for reusability in a platform like Node. We probably don't want to support command or environment substitution, for example, but should make it straightforward to add in userland.)

@bcoe
Copy link

bcoe commented Jan 30, 2020

I'm going ahead and reopening this, since it sounds like @isaacs is a potential champion -- it's fine if things take a while 👍

@Delapouite
Copy link

A PR introducing fs.glob and fsPromises.glob has been merged: nodejs/node#51912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants