Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob: more flexible than walkDirRec #15598

Closed
wants to merge 15 commits into from

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Oct 16, 2020

  • BFS and DFS traversal
  • optional callback to choose whether to recurse down a directory; this can make a large difference by skipping early rather than filter the yielded elements
  • optional callback to sort immediate children of a directory; this allows the intersting property of iteration in global sorted order (both BFS and DFS) while only requiring O(n) memory where n = largest nb immediate children of any directory; we don't need to sort all (recursive) entries
  • ~3X faster than unix find (not sure why); I tested on a dir with 500k recursive entries (filesystem was hot, ie after a coldstart)
  • optional epilogue option: each dir is yielded both before all its children and right after, which simplifies aggregate processing, eg to generate aggregate statistics per directory (eg: recursive number of files rooted in each dir)
  • see tests tests/stdlib/tglobs.nim

links

limitation

@timotheecour timotheecour marked this pull request as ready for review October 16, 2020 08:52
@ghost
Copy link

ghost commented Oct 16, 2020

Is there any reason that this module should be in stdlib? Maybe fusion is a better choice?

Also, did you see https://github.com/citycide/glob? It offers a complete cross-platform glob solution too

@timotheecour
Copy link
Member Author

timotheecour commented Oct 16, 2020

Is there any reason that this module should be in stdlib? Maybe fusion is a better choice?

I'm open but wanted to explore std as an option first given that its IMO general purpose enough for std (eg kochdocs relies on it and could be useful for tests too).

Also, did you see citycide/glob? It offers a complete cross-platform glob solution too

I'm quite familiar with it (I've contributed via several PR's and design discussions / bug reports). I use pkg/glob a lot, but I've written this to address several shortcomings in pkg/glob, see following points:

simpler thanks to separation of concern

this PR's glob delegates all custom iteration logic via callbacks, so has much reduced complexity as it doesn't try to do any file pattern matching, regex handling etc. These can be handled by another library that builds on top of glob, eg by instantiating a yield filter to allow handling patterns like a/**/*.nim or regex patterns. pkg/glob itself could potentially do that too, by calling glob and handling the rest /cc @citycide, thus gaining support for all the features introduced in this PR (see below).

What it does concern itself is what's impossible to do on calling side, such as sorting and controlling iteration (BFS vs DFS). Doing it on calling side would otherwise require storing all elements in an array, which would be probihitive / inefficient.

return results in natural tree order

pkg/glob doesn't return in natural tree order:

import pkg/glob
for path,kind in walkGlobKinds("./**", options=defaultGlobOptions + {GlobOption.Directories}): echo path

can return:

a
b # bad: a should come right before its children
a/b1
a/b2
a/b1/c1

instead of:

a
a/b1
a/b1/c1
a/b2
b

which is what tools like unix find or python3 glob or this PR do

import glob
for name in glob.glob('./**', recursive = True): print(name)

BFS vs DFS

DFS search is the default, but BFS is also supported, which is useful in some applications.
See also the bracketed option ..., dir(begin), dir/children..., dir(end), ... which is useful for compute aggregate data on directories (eg nb files / total size). This would be cumbersome to do on caller side.

sorting

this PR allows returning in global sorted order without requiring to fit all entries in memory. See details in PR.

@haltcase
Copy link
Contributor

haltcase commented Oct 16, 2020

I'd be happy for my code to be subsumed by a std lib solution, but if it's decided that a package is a better place I would also be glad to add additional collaborators to https://github.com/citycide/glob (and in fact @timotheecour already has that access) for these changes to be discussed and implemented there — without my sole power over it.

Edit: I could also move my glob repo into an organization or transfer it to an appropriate existing one that might already have people willing to help maintain it.

@Araq
Copy link
Member

Araq commented Oct 19, 2020

Not acceptable for the following reasons:

  • "glob" is a terrible name that doesn't mean anything to newcomers.
  • The standard library already offers acceptable means to accomplish the same. Yes, yes, it doesn't support DFT and no callback hooks etc. So what, there is the "glob" package for these cases or you create your own "glob" package and see how many use it or you write a tiny bit of custom code when the need arises.

@Araq Araq closed this Oct 19, 2020
@timotheecour
Copy link
Member Author

timotheecour commented Oct 20, 2020

ok so can I migrate this to a new fusion PR?
fusion is intended to be "batteries included", broader in scope than stdlib and this is a well defined/scoped API that is of general enough use. kochdocs itself uses this API (or rather a prior version of it, see walkDirRecFilter), and AFAIK fusion can't import pkg/glob so wouldn't be usable within fusion.
More importantly, I've already outlined how this PR differs from pkg/glob.

"glob" is a terrible name that doesn't mean anything to newcomers.

happy to change name

@timotheecour
Copy link
Member Author

timotheecour commented Oct 22, 2020

@Araq This PR was closed prematurely, can we re-open? Happy to make whatever improvements are needed for stdlib inclusion.

here's another recent example where this PR comes handy: #15612

this PR takes care of the nasty details in an efficient (single yield statement) and generic way (callbacks) so client code doesn't have to keep re-inventing this functionality and fall in the common traps:

# main.nim
import std/[globs,os,sugar]
for ai in glob(".", relative = true, sortCmp = (a,b) => cmp(a.path, b.path)):
  if ai.kind == pcFile: echo ai.path

=> ./main | wc -l is 25x faster than nimgrep -r --filenames . | wc -l (for 63723 files)

note that multithreading could be maybe added in future for further performance gains, the details of which should be done once in a library instead of re-implemented (potentialy poorly) in all clients. Likely one can do better than However this increases memory consumption in multi-thread mode since search results need to be stored in memory until contiguous sequence of files is received from thread workers

The order in which files appear is fixed now: it's alphabetic sorting by default (unless --sortTime is specified). I want the ability to expect that files will appear in stable order any time to memorize where should I look to; while Linux directory walking yields you files and directories non-deterministically. It's especially important with threads that would mess up the order if they were given the chance. (However this increases memory consumption in multi-thread mode since search results need to be stored in memory until contiguous sequence of files is received from thread workers)

@Araq
Copy link
Member

Araq commented Oct 22, 2020

If the iterator takes a callback you might as well turn it into a proc then that takes a callback. Besides, this is what we have "fusion" for.

@timotheecour
Copy link
Member Author

=> nim-lang/fusion#32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add walkDirRecFilter recursive glob with follow filter (has full implementation)
3 participants