Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-platform file system abstractions #10

Open
killercup opened this issue Feb 20, 2018 · 70 comments
Open

Cross-platform file system abstractions #10

killercup opened this issue Feb 20, 2018 · 70 comments
Labels
A-ergonomics Area: Ease of solving CLI related problems (e.g. filesystem interactions) C-tracking-issue Category: A tracking issue for an unstable feature

Comments

@killercup
Copy link
Collaborator

killercup commented Feb 20, 2018

(moderated summary by the WG)

@killercup's original post

In the first meeting, we talked a bit about the pain points of dealing with files and path in a cross-platform manner.

One idea is to create or improve crates that provide higher-level abstractions than std in this area.

@killercup
Copy link
Collaborator Author

killercup commented Feb 20, 2018

From the etherpad:

@killercup
Copy link
Collaborator Author

Maybe also relevant (also from the etherpad):

  • streaming
    • how to thread progress computation through chained readers
    • handling different archive formats (this may be specific to me but I bet others will hit it)
      • how to stream zip files (no crates.io crate supports this AFAIK)
      • how to compute progress percentage of decompressing files

@matthiasbeyer
Copy link
Member

I want to simply dump this here: I'm the author of imag and we have a rather nice abstraction for FS access... maybe you can learn from that. Our abstraction is rather specific for our use-case (each file has a TOML header and a "content" section), but the basic concept (a "store" which holds the files and they can then be borrowed from that store, so that concurrent access to one file is not possible) could be useful for other CLI apps.

More information about the Store can be found in its source or our (rather not-so-up-to-date, as we are basically a one-dev-project) documentation on the store. I'd say the source is a better source of information, though.

@sicking
Copy link

sicking commented Mar 2, 2018

One thing that we talked about at Mozilla which would be useful in Firefox, as well as possibly useful to expose to the web, would be an API for doing atomic file writes in "small" files.

As I understood it, the best way to do truly atomic writes efficiently is to copy the file, while inserting the desired modifications, into a new file. Once the new file has been written issue a rename to the original file name.

Alternatively, if the file is small enough and multiple modifications are expected, read the file contents into memory and each time that a modification should be done write out a new file and rename to the desired file destination.

The nice thing with this approach is that it can be done without complex journaling files, and without needing to call flush() which can be quite a perf bottleneck.

This was something that we were thinking of using for the numerous small configuration/database files that Firefox keeps.

I think something like this would fit well with Rust's focus on performance and safety.

I'm not sure that this is particularly CLI specific though. But since other filesystem stuff was discussed I figured I'd mention it.

@luser
Copy link

luser commented Mar 21, 2018

windows gotchas, like long file paths (which require extended-length paths aka "verbatim paths" as they're called in std)

For some reason I thought the APIs in std did this internally but I was wrong. :( I believe Python's standard library converts all paths to verbatim paths internally nowadays. This is unfortunate in that it's really easy to run into the 256 character limit in practice, but I'm not sure there's a way to square this with the zero copy philosophy behind things like OsStr and Path.

AFAIK it's OK to use verbatim paths in all Windows APIs, so perhaps a crate that provides a small wrapper type would be good enough? Something like VPath { inner: Cow<Path> }, and then VPath::new would take AsRef<Path>, and if it has the \\?\ prefix borrow it, if not prepend it to an owned PathBuf. The real trick would be enforcing that all your APIs take this path type, so that you can ensure you're only ever passing verbatim paths to OS APIs. Without having it in std that's going to be a bit of a pain.

(If someone writes such a crate I would also like it to have an easy to_wide method for getting a wchar_t*-compatible buffer to pass directly to Windows APIs using ffi. I've found OsStrExt::encode_wide clunky to use in practice.)

@luser
Copy link

luser commented Mar 21, 2018

absolute paths on Windows (UNC prefices)
https://github.com/notion-cli/notion/blob/master/crates/verbatim/src/lib.rs

FYI this link 404s now, the code seems to have moved to https://github.com/dherman/verbatim

@luser
Copy link

luser commented Mar 21, 2018

FYI this link 404s now, the code seems to have moved to https://github.com/dherman/verbatim

...and after actually reading a little, it sounds like that crate is about 90% of what I proposed in my previous comment. :)

@BurntSushi
Copy link

See also:

I would be interested to here more about how other ecosystems solve this. In particular, how do they deal with relative paths?

@XAMPPRocky
Copy link

@killercup I believe normalisation on happens on HFS+ on macOS, I don't believe APFS does normalisation on the mac.

@BurntSushi
Copy link

On Unicode normalization and HFS: BurntSushi/ripgrep#845

@epage
Copy link
Contributor

epage commented Mar 21, 2018

AFAIK it's OK to use verbatim paths in all Windows APIs, so perhaps a crate that provides a small wrapper type would be good enough?

My thought was to just start with a "normalize_path" crate/function that people can use regardless of what path API they are using, kind of like normalize-line-endings.

This crate would handle

  • removing extraneous /, ., etc
  • Resolve .. where possible (e.g. path/../file -> file)
  • Convert to verbatim paths

Then on top of this we could look into an "easy paths" api that is more like python's pathlib combined with easy_strings to help with the prototyper case (easy to reach for needed utilities, less concern for borrow checker at cost of memory or cpu time). This would call "normalize_path" on any untrusted input. I've been providing feedback on ergo_fs in hopes of going in this direction.

In particular, how do they deal with relative paths?

What is your concern with relative paths?

@BurntSushi
Copy link

BurntSushi commented Mar 21, 2018

What is your concern with relative paths?

From MSDN:

Because you cannot use the "\\?\" prefix with a relative path, relative paths are always limited to a total of MAX_PATH characters.

@Screwtapello
Copy link

Resolve .. where possible (e.g. path/../file -> file)

This is not generally possible in POSIX because of symlinks: if a/b/c is a symlink to a/d, then a/b/c/../foo is actually a/foo, not a/b/foo. Now that modern versions of Windows support symlinks too, I suspect that's also an issue there.

That said, Rust kind of has this already in the form of std::fs::canonicalize(). It's still kind of clunky, though, because it requires every path component to exist. Compare the Python pathlib Path.resolve() method: "If strict is False, the path is resolved as far as possible and any remainder is appended without checking whether it exists." That means you can normalise your input and output paths while parsing your command-line, without having to create them first.

@epage
Copy link
Contributor

epage commented Mar 21, 2018

I'm mixed on what I'd expect for that symlink scenario. Either way, I assume that with a strict=False, we would allow that symlink bug to go through.

But for pathlib, apparently, the symlink bug is always there.

I'm surprised though, I thought handling of .. was part of parsing the path and not from an explicit operation (resolve).

@Screwtapello
Copy link

But for pathlib, apparently, the symlink bug is always there.

Pathlib should actually be OK: in my example, as it resolves a/b/c/../foo it will read the symlink c producing a/d/../foo, so by the time it processes the .. it will get the right answer.

I thought handling of .. was part of parsing the path

In POSIX, the kernel just takes each path segment and hands it to the filesystem to resolve, and the filesystem physically records a .. entry on disk that points to the parent directory. Since filesystem is the single source of truth, getting the right answer requires actual disk I/O. :(

Plan9 behaves the way you expect, but then it doesn't have symlinks so this whole situation just isn't a problem there.

@soc
Copy link

soc commented Mar 23, 2018

One of the issues that would need to be addressed in a better way is PathBuf.join.

A method that both concatenates a path or completely replaces the existing one based on the input ... when did one ever have that use-case? I'd guess that pretty much every piece of code using join without first checking whether the argument is relative or absolute is subtly broken.

The problem is not that the method exists, but that the individual operations that this method combines do not exist as their own methods.

Considering this and the Windows-related issues, I think this makes a good case for having separate types for absolute and relative paths.

@epage
Copy link
Contributor

epage commented Mar 23, 2018

I've tried to do a quick and dirty summary of this. Please point out where I need to expand it!

I know on either users or reddit, I saw complaints about

Anyone know where these were so we can reach out to the people that have concerns? Or add your own thoughts on these topics?

@Screwtapello
Copy link

A method that both concatenates a path or completely replaces the existing one based on the input ... when did one ever have that use-case?

I'd expect that to be the majority use-case for dealing with user-specified paths.

  • I take an input file from the command-line, path/to/my/input
  • The input file contains a directive like src = ../some/data, or possibly src = /usr/share/myapp/data
  • My program does let src_path = input_path.parent().unwrap().join(input_data.src) and I get the right answer

The wrinkle in the above is unwrapping the result of .parent(), but as long as my program always makes input_path into an absolute path before working with it, that should be fine.

@i30817
Copy link

i30817 commented Mar 24, 2018

Join is useful to make absolute paths from relative paths, which in turn are useful for nearly all cfg files, for both sharing across OSes and moving cfg files. If you can guarantee some validations of the user made paths on a cfg (is relative, doesn't use any (not just the current) OS forbidden characters except '/' and closed quotes, len(cfgpath.parent().join(relativepath)) isn't larger than MAX_PATH_LENGTH in the current platform ) you can even assure it's somewhat OS portable.

Though i suppose it's better than all those preconditions are specified in a higher level cfg abstraction, join could still be be useful (in spite of being easy to blow up by passing a absolute path to the suffix).

It would have to deal with stuff like join("C:\", "relative/path") to be useful though (and the constructor with new("C:\relative/path"). Much C code assumes the current platform separator, even if it's not the only one when iterating over paths.

I think java solution here was to make 'path' iteration be at directory granularity? Or path is not iterable i don't remember.

@soc
Copy link

soc commented Mar 24, 2018

@Screwtapello

I'd expect that to be the majority use-case for dealing with user-specified paths.

I think it's highly dangerous that

path.join("foo/").join("bar")
path.join("foo").join("/bar")

do completely different things.

I don't think most people will expect that "joining" one path to an existing path can destroy their existing path.

I think a serious path implementation should clearly separate those operations, e. g.

AbsolutePath::join(&self, path: RelativePath)
AbsolutePath::replace(&self, path: AbsolutePath)
RelativePath::join(&self, path: RelativePath)

"joining" an absolute path should not even compile.

@vitiral
Copy link

vitiral commented Mar 24, 2018

I am just seeing this now.

First, I'd like to announce the release of 0.4.0 of path_abs which now uses "absolute" instead of "canonicalized" paths. This means that you can have a path with symlinks that may or may not exist, but PathAbs will always start with the "canonical" root directory. See the docs for more.

I'm going to just do a checklist of things from this thread and open issues that aren't covered:

File path handling

  • .., ., //not being auto- handled: this is handled in v0.4.0 by the PathAbs type (which all types other than PathArc depend on). Note that .. is handled semantically (a/b/../d is always a/d), the idea being that if you wanted to actually follow the symlink then you would use canonicalize().
  • relative paths are converted to absolute using env::current_dir and then resolving.
  • windows gotchas, like long file paths (which require extended-length paths aka "verbatim paths" as they're called in std): this should be handled by 0.4.0, but more testing is needed. Basically all path roots are forced canonical by calling canonicalize() on their root if they are not canonical.
  • absolute paths on Windows (UNC prefices): done in 0.4.0
  • guaranteed serialization/deserialization of paths(using stfu8)

Edit: I missed some

@vitiral
Copy link

vitiral commented Mar 24, 2018

Also, ergo_fs is related to this discussion.

@vitiral
Copy link

vitiral commented Mar 24, 2018

Edit: I moved my comment about weird join behavior to a path_abs issue instead:

vitiral/path_abs#12

@soc
Copy link

soc commented Mar 24, 2018

The full glory of handling Windows paths: https://googleprojectzero.blogspot.de/2016/02/the-definitive-guide-on-win32-to-nt.html

TL;DR: Depending on how low-level you want the API to be, you have to handle 7 different path types.

@vitiral
Copy link

vitiral commented Mar 24, 2018

@soc those should all be handled by path_abs here.

If you think one is missing please open a ticket!

@soc
Copy link

soc commented Mar 26, 2018

Yep. Managing code where the developer has control over the path and file names is my use case for the library I'm toying with. Such a library would be exactly what's needed for reading/writing/modifying configuration in a cross-platform compatible manner.

It's not meant to deal with all the insanities operating systems have invented. If that's your use case, use Rust's path.

Getting things right and reliable has its cost. In some cases it makes sense to pay this cost to get the reliability, in some cases it doesn't.

You shouldn't upfront the cost of this thing that may never happen to users.

The point is that these things do happen. Chrome had a security issue due to Windows' general path traversal craziness. If one of the largest IT corps with a highly skilled security department can't deal with paths, what's the chance some random developer can?

@vitiral
Copy link

vitiral commented Mar 26, 2018

@soc I think this conversation should be moved to a separate issue since I think it is a very small part of the other concerns here.

@soc
Copy link

soc commented Mar 26, 2018

@vitiral Thanks, agree on that.

@Screwtapello
Copy link

I've been thinking about path-handling this weekend, after I posted a question to Reddit and had a discussion with @vitiral about his path_abs crate. I've come up with a model that covers my needs and expectations, but I'm interested to hear whether it would suit other people too.

Motivation

For tools that take a path on the command-line and just use it immediately (think sort or grep), the friendliest approach is also the simplest: just take the command-line argument, pass it to the filesystem APIs, and any problem you encounter will be straightforward to diagnose.

For tools that take a path from a config file or a database, tools run as batch jobs, or tools that generate or manipulate paths, relative paths can make problems difficult to diagnose: You get a report saying "File not found: some/relative/path", you look in the place you thought the tool would look and the file is definitely present, so clearly the tool was looking somewhere else—but where? To avoid confusion, I want every tool I write to use complete paths, so that when I look at a log-line or error message I can tell exactly what it was looking at.

Design

I want to use "monotonic paths" as my standard in-memory representation of a filesystem path. Some definitions:

A path is a sequence of zero-or-more components (in the std::path::Component sense), which may be divided into a "head" and a "tail".

A path's head is:

  • (on Windows) an optional Prefix component, followed by
  • an optional RootDir component.

A path's tail is zero-or-more CurrentDir, ParentDir or Normal components.

A monotonic path is one whose tail contains only Normal components, no CurrentDir or (especially) ParentDir components.

A monotonic absolute path is a monotonic path whose head contains (on POSIX) a RootDir component or (on Windows) both a Prefix and a RootDir component (i.e. std::path::Path::is_absolute() would return true). By a happy coincidence, this also describes Windows' extended-length path syntax.

A monotonic relative path is a monotonic path whose head contains no components. A monotonic relative path can be blindly appended to another monotonic path without breaking its monoticity.

Example monotonic paths:

  • /etc/passwd
  • C:\Windows\System32
  • \\?\UNC\server\share\some\path
  • src/main.rs

Example non-monotonic paths:

  • \Program Files (x86)\ (has a RootDir but no Prefix)
  • .git/hooks/../config (contains a ParentDir component)
  • D:donut (has aPrefix but no RootDir)
  • /home/st/./.bashrc (contains a CurrentDir component)

Implementation

Making a path monotonic involves making the head monotonic, and making the tail monotonic.

Making the head monotonic is easy, since you can just pass it to std::fs::canonicalize().

Making the tail monotonic involves removing the CurrentDir components (done automatically by the Components iterable), and the ParentDir components.

To remove ParentDir components from the tail:

  • Find the first ParentDir component in the tail
  • If there aren't any, you're done!
  • Otherwise, if the ParentDir component is at the beginning of the tail, remove it
  • Otherwise, if the Normal component preceding a ParentDir component is not a symlink (paths that do not exist are by definition not symlinks) you can remove both
  • Otherwise, read the symlink target and splice it into the path you're evaluating
  • Repeat

You could go even further and resolve as many symlinks as possible instead of just the ones preceeding ParentDir components, like Python's pathlib.Path.resolve(). That would be more explicit, but no more correct and more expensive.

In Rust, I imagine there would be monotonic_path::Absolute and monotonic_path::Relative newtypes around std::path::PathBuf, with constructors that ensure/validate monotonicity, and type-safety guarantees that monotonic paths can only be extended by monotonic relative paths, not by arbitrary paths.

Alternatives

Just use whatever path you were given, as-is

As mentioned in the motivation, this can lead to confusion when a user comes up with a path in one context, then gives it to a program that uses it in a different context (a different time, a different host, a different working directory). Turning relative paths into fully-explicit paths makes it clearer what context the program is using.

If you get a relative path, just join it onto the current working directory

Relative paths may contain any number of .. segments. If you see a path like /var/lib/myapp/sessions/01/../02/../../common/backends/../../../lock, can you tell at a glance what file is really being accessed? Did you notice it was outside the myapp directory?

A bigger problem is that this approach doesn't work on Windows. If your current working directory is C:\apple\banana and you get a path like D:donut, you do not have enough information to interpret that path: it's relative to whatever the current working directory was, the last time it happened to be somewhere in the D: drive.

If you get a relative path, just use std::fs::canonicalize()

std::fs::canoncalize() requires that the given path actually exist in the filesystem. If you want to create a path, you want to canonicalize it before you create it, but you have to create it before you can canonicalize it.

Also, std::fs::canonicalize() always tells you the target of a symlink. If you want to talk about the symlink itself, a fully canonicalized path is not helpful.

Make paths monotonic by just deleting components without checking the filesystem

One way to remove the ParentDir components is just to delete each Normal/ParentDir pair from the path until no more remain. This is sometimes called normalization or cleaning, a process that is simple, fast and wrong: if the Normal component before the ParentDir happens to be a symlink, the ParentDir can take you somewhere entirely different. For example:

$ mkdir -p /tmp/apple/ant
$ mkdir -p /tmp/banana
$ ln -s ../apple/ant /tmp/banana/bat
$ ls /tmp/banana/bat/..
ant

That is, /tmp/banana/bat/.. is /tmp/apple, not /tmp/banana.

Versus the issues in the summary

  • .., .., // not being auto-handled
    • not sure what this means, but I don't see paths containing any of those string being issues for a program using monotonic paths
  • windows gotchas, like long file paths
    • any monotonic-path crate will have to make use of std::fs::canonicalize(), and the monotonic path rules already describe the requirements of Windows extended-length paths, so this should Just Work
  • absolute paths on Windows (UNC prefixes)
    • as above, this should Just Work
  • UTF-8 filenames on macOS
    • comparing filenames from different sources is out-of-scope for monotonic paths
  • atomic file writes
    • out of scope for monotonic paths, but I hope any monotonic path library has a pathlib-like API that includes a method like this
  • Path.join
    • since a monotonic path can only be joined by a relative path, this problem is solved too
  • "Simple" API for rapid prototyping
    • out of scope
  • format! with non-UTF-8 paths
    • out of scope

@soc
Copy link

soc commented May 28, 2018

I played with some of my own ideas in https://github.com/soc/paths-rs.

The focus is different though: I'm largely interested in being able to have paths (e. g. in config files) that can be read/used/written/moved across operating systems with the guarantee that a path that has been constructed is valid across all operating systems.

@deg4uss3r
Copy link

I discussed a but in the gitter about this, but a lot of times when I write a CLI with a config file, or a environmental variable on a unix/linux platform I run into the dreaded ~ or $HOME problem (e.g. in the config file something like cert=~/.certificates/ricky.pem.

I know that the shell would traditionally handle these expansions, but people (myself included) do put these shorthands in config files, variables, or use them fairly often with CLI arguments.

It would be really nice to have a wrapper crate that I can trust to handle all paths/expansions when dealing with a CLI program. I agree with above about the cross platfrom as well.

@soc
Copy link

soc commented Jun 2, 2018

Yes, that's exactly my goal. It's not implemented though, but the idea is to have special tokens like §HOME_DIR§ or §CONFIG_DIR§ that work across Linux/BSD/macOS/Windows (leveraging the directories crate).

These tokens are intended to exist explicitly in the serialized format (as strings in a config file) as well as in the memory representation and are only resolved when converting AbsolutePaths to Rust's Path/PathBuf.

@gdzx
Copy link

gdzx commented Jul 12, 2018

Just mentioning the following, hope that could be helpful:

  • the primary reason I came here is that I was just fooled by the behavior of Path::join with absolute paths even though I knew perfectly how it works, which for me confirms that this is incredibly dangerous!
  • Adding Path.normalize() method rust-lang/rust#47363 advanced a little bit on implementing path normalization like Python's os.path.normpath() and Go's filepath.Clean() (unfortunately the work was not completed),
  • Rob Pike's article about Getting Dot-Dot Right specific to Plan 9 has already been linked by @Screwtapello but ideas for a simple, cross-platform path handling API could be drawn from Go's standard library path/filepath (I never used it on Windows though, there might be limitations I don't know of; nevertheless, the source is pretty readable),
  • chown is not even a thing in std::fs, fortunately that's not the most unsafe thing to use right from libc,
  • even though this issue is about cross-platform abtractions, I quite like golang.org/x/sys/* packages that expose low-level OS primitives which remain accessible without too much hassle for each platform. Currently in Rust the nix crate implements this kind of things for Unix, but unfortunately, the documentation is severely lacking and its not really feature complete...

Thanks for your time.

@epage
Copy link
Contributor

epage commented Jul 12, 2018

ideas for a simple, cross-platform path handling API could be drawn from Go's standard library path/filepath

Could you be more specific of what lessons we should learn from go? I'd rather not assume incorrectly and miss the information you are trying to share.

@gdzx
Copy link

gdzx commented Jul 13, 2018

@epage

  • two distinct normalization methods:
    • filepath.Clean to normalize without touching the filesystem (which implies that "a/b/../c" is always "a/c"),
    • filepath.EvalSymlinks which evaluates symlinks in the filesystem and then calls filepath.Clean on the result ("a/b/../c" might not be "a/c"),
  • filepath.Base:
    • uses the "base" name terminology whereas Rust uses Path::file_name, indicating that "If it's the path of a directory, this is the directory name." which is a little bit strange,
    • filepath.Base corresponds to POSIX basename, thus returns ".." when called with ".." (note that GNU basename variant returns "" when called with "/" or "foo/", while POSIX basename would return respectively "/" and "foo"). In comparison, Path::file_name returns None when called with either ".." or "/",
  • filepath.Dir corresponds to POSIX dirname which returns "/" when called with "/", whereas Rust's Path::parent would return None,
  • filepath.Join:
    • joins paths without replacing when it encounters an absolute component, unlike Path::join,
    • allows to build path quickly with filepath.Join(e1, e2, ..., en) (maybe a Path::from a slice of components?),
  • filepath.Rel(base, target) returns the relative path from base such that Join(base, Rel(base, target)) == target ("equivalent" of Path::strip_prefix):
    • ("/a", "/a/b/c") -> "/b/c"
    • ("/a", "/b/c") -> "../b/c", but in this case, Rust's Path::strip_prefix would return None,
    • ("/a", "./b/c") -> None (can't be made relative to "/a" without looking up the current dir)
  • filepath.SplitList allows to handle PATH-like, : separated paths,
  • filepath.VolumeName equivalent? (I see a std::path::PrefixComponent struct but that doesn't seem straightforward),
  • paths are just strings in Go, but when manipulating directly OsString in Rust, we lack a lot of methods that are available for [u8] or String which can be annoying EDIT it seems that's already in discussion Ergonomics of String handling #26,
  • filepath.Match and filepath.Globs are included, not sure if they truly allow cross-platform use though

@epage
Copy link
Contributor

epage commented Jul 13, 2018

Thanks for that overview, that will be really helpful.

I do feel like to certain audiences, POSIX behavior would be surprising, so there is a trade off of being familiar to POSIX people or not be surprising to non-POSIX people (like if you iterate on Dir, you'll never terminate without an explicit base-case check).

paths are just strings in Go, but when manipulating directly OsString in Rust, w

Even Python recognized using strings is broken and now supports bytes as well :). Granted, we do need a good way of interacting with the native Path string type.

@epage
Copy link
Contributor

epage commented Jul 24, 2018

  • filepath.Join:
    • joins paths without replacing when it encounters an absolute component, unlike Path::join,

What is the uyse case for appending an absolute path to an absolute path?

I feel like the "replace" is meant to act like CWD handling. I could see having distinct and more semantically meaningful names like append and a merge

  • allows to build path quickly with filepath.Join(e1, e2, ..., en) (maybe a Path::from a slice of components?),

I think this is an area where documentation is needed to clear this up but you can use extend:

https://doc.rust-lang.org/std/path/struct.PathBuf.html#impl-Extend%3CP%3E

@gdzx
Copy link

gdzx commented Jul 24, 2018

What is the uyse case for appending an absolute path to an absolute path?

In a program to manage OS trees, I have to copy initramfs & vmlinuz from /etc/boot (in the container) to /etc/boot (in the host). With the current behavior, I need to have: const BOOT_PATH: &str = "etc/boot" and always manually prepend something.

That seems counter-intuitive with what I'm used to do in other languages, where I expect "/container" + "/etc/boot" = "/container/etc/boot", the way you would push strings in a buffer.

I feel like the "replace" is meant to act like CWD handling. I could see having distinct and more semantically meaningful names like append and a merge

cd allows to change from one directory to another, whereas with PathBuf you'are building a single path buffer. I agree the naming is not perfect.

I think this is an area where documentation is needed to clear this up but you can use extend:

extend requires to have the PathBuf already built, but you are right, this is already possible with let path: PathBuf = [r"C:\", "windows", "system32.dll"].iter().collect(); (it's actually one of the first examples in the docs that I totally missed...). Though the issue regarding absolute paths remain in this case.

@epage
Copy link
Contributor

epage commented Jul 25, 2018

That seems counter-intuitive with what I'm used to do in other languages, where I expect "/container" + "/etc/boot" = "/container/etc/boot", the way you would push strings in a buffer.

First, not all languages do that. I know of at least Python and C++ have the same behavior as Rust.

I suspect the difference is whether the API is meant to be a convenience over string manipulation or if it treats paths as first class objects.

cd allows to change from one directory to another, whereas with PathBuf you'are building a single path buffer. I agree the naming is not perfect.

I'm a little confused at how cd enters the picture. I was talking about how Rust/Python/C++'s behavior is like if you pass a relative or absolute path to a command to a program. If its relative, it gets append to the CWD. With these path APIs, you aren't just beholden to global state but can define what the paths might be relative to.

extend requires to have the PathBuf already built, but you are right, this is already possible with let path: PathBuf = [r"C:", "windows", "system32.dll"].iter().collect(); (it's actually one of the first examples in the docs that I totally missed...). Though the issue regarding absolute paths remain in this case.

Oh, right, that does make things more annoying. Definitely something to keep in mind when we get to the "simplified path API" (I plan to get to it after assert_cmd and cargo-tarball).

@gdzx
Copy link

gdzx commented Jul 25, 2018

First, not all languages do that. I know of at least Python and C++ have the same behavior as Rust.

I fail to see how replacing is a sensible default behavior but obviously that must be good if Python, C++ and Rust do that.

From a security point of view, if I accept a user provided path Q that must be restricted to a directory P, I can do:

if !strings.HasPrefix(Q, "/") {
    Q = "/" + Q
}
R = filepath.Join(P, filepath.Clean(Q))
  • P = "/foo", Q = "bar.txt" -> R = "/foo/bar.txt"
  • P = "/foo", Q = "/bar.txt" -> R = "/foo/bar.txt"
  • P = "/foo", Q = "/etc/passwd" -> R = "/foo/etc/passwd"
  • P = "/foo", Q = "../etc/passwd" -> R = "/foo/etc/passwd"

EDIT: Just looked at the static file serving example from Rocket. They do Path::new(P).join(Q) which is safe because they always validate the parameter Q quite restrictively in the impl FromSegments for PathBuf.

I'm a little confused at how cd enters the picture. I was talking about how Rust/Python/C++'s behavior is like if you pass a relative or absolute path to a command to a program. If its relative, it gets append to the CWD. With these path APIs, you aren't just beholden to global state but can define what the paths might be relative to.

I still don't understand then. Could you give me an example?

@gdzx
Copy link

gdzx commented Jul 25, 2018

This is a summary of the things I would absolutely need when working with paths in Rust, so I don't have to roll-out my own validation and sanitization methods to prevent path traversal, I don't have to be too far from what POSIX defines and I can work with normalized paths without syscalls.

Normalization with Pure Lexical Processing

Like fs::canonicalize without ever touching the filesystem.

path::normalize("/////var/lib/../../etc/mozilla/"); // --> /etc/mozilla

Non-destructive PathBuf::append

Like PathBuf::push but does not replace the stored path if the argument is absolute.

let mut path = PathBuf::from("/usr"); // --> /usr
path.append("/lib");                  // --> /usr/lib
path.push("/etc");                    // --> /etc

Non-destructive Path::concat

Same as Path::join but using PathBuf::append instead of PathBuf::push.

Path::new("/usr").concat("/lib"); // --> /usr/lib
Path::new("/usr").join("/lib");   // --> /lib

POSIX dirname and basename

man 3 basename

path		dirname		basename
/usr/lib	/usr		lib
/usr		/		usr
usr		.		usr
/		/		/
.		.		.
..		.		..

These methods allow chaining by always returning something sensible, contrary to Path::file_name and Path::parent (where you need to check for Option).

let config_paths = [
	"/boot/boot.cfg",
	"/etc/systemd/resolved.conf",
	"/etc/sysctl.d/sysctl.conf"
];

for p in config_paths.map(|p| Path::new(p)) {
	let target = Path::new("/media/backup").concat(p.base_name());
	// --> /media/backup/boot.cfg

	...
}

Path::relative_to and Path::absolute

Path::relative_to would allow to compute the relative path to a base path. These methods need the CWD, even though relative_to doesn't need it if Path::starts_with is true so there should be some other method to do it without looking up the CWD, by returning an Option for instance so there's no hidden cost).

@epage
Copy link
Contributor

epage commented Sep 12, 2018

There might be some useful notes for this in a soon-to-close RFC: rust-lang/rfcs#2188

@soc
Copy link

soc commented Sep 12, 2018

@gdouezangrard Thanks for this useful overview, I think I'll follow that approach in my paths crate.

@epage
Copy link
Contributor

epage commented Sep 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ergonomics Area: Ease of solving CLI related problems (e.g. filesystem interactions) C-tracking-issue Category: A tracking issue for an unstable feature
Projects
None yet
Development

No branches or pull requests