Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Path.normalize() method #47363

Closed
wants to merge 17 commits into from

Conversation

linclelinkpart5
Copy link

As per the discussion here: rust-lang/rfcs#2208

I found myself with a need to normalize file paths. By "normalize", I refer to the lexical cleanup of a path, following a similar algorithm as Python's os.path.normpath() and Go's filepath.Clean(). These in turn follow the methodology outlined in Rob Pike's paper: Lexical File Names in Plan 9.

I was able to write a method that works for my use case, and I felt it could be a helpful addition to stdlib! However, I am very new to both Rust and contributing to OSS projects. Please do let me know if there are any changes that could/should be made in those regards.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@retep998
Copy link
Member

retep998 commented Jan 11, 2018

I don't see any tests for \\.\ device paths or \foo\bar rooted paths. There's a lot of tests missing from the Windows version.

@linclelinkpart5
Copy link
Author

@retep998: Those are good points, I can add those tests. What would be some other good Windows-specific tests to include?

I must note, I don't have access to a Windows dev box, so it's tricky to run these Windows-specific tests.

@retep998
Copy link
Member

At the very least, we should have every single example from the tables on https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html

@joshtriplett
Copy link
Member

You should also include test cases for filenames containing spaces, tabs, newlines, and all three.

Please use raw strings for filenames containing backslashes, so that the test cases don't need to escape backslashes.

@retep998
Copy link
Member

@rust-lang/libs Do we want to use GetFullPathNameW on Windows or do we want to roll our own normalization, and if we do choose the latter, where exactly should we differ from the behavior of GetFullPathNameW?

@joshtriplett
Copy link
Member

joshtriplett commented Jan 11, 2018

@retep998 If we don't use GetFullPathNameW on Windows, perhaps we should use it as part of the test cases, by calling both it and Path.normalize() on every test case and failing if they differ. (Unless we have some reason to differ, in which case we should have specific test cases for that.)

@kennytm kennytm added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 11, 2018
@linclelinkpart5
Copy link
Author

Does GetFullPathNameW use the underlying filesystem? My goal with this method was to perform purely lexical conversion of a path.

@joshtriplett
Copy link
Member

@ThatsGobbles According to the MSDN documentation, "This function does not verify that the resulting path and file name are valid, or that they see an existing file on the associated volume."

It does, however, look at the process's current working directory.

@linclelinkpart5
Copy link
Author

linclelinkpart5 commented Jan 12, 2018

I managed to get a Windows 10 VM up and running, and am trying out tests. The normalization does not seem to like the examples in The Definitive Guide to NT and Win32 Paths involving trailing dots and spaces at the end of a path (the DEF. . examples), mixing forward and backslashes in UNC or root local device paths (e.g. \\?\X:/ABC/DEF), or using COM ports. It seems that Path.components() isn't treating these as special cases?

@linclelinkpart5
Copy link
Author

It appears that this Windows shell function does what I'm aiming to do with my function!

Now the question would be, how do I get access to it in my Rust code?

@linclelinkpart5
Copy link
Author

As it turns out, I'm not really sure what I should be doing to get this working. I don't know anything about Windows C FFI in Rust, which is stopping me from using the PathCch methods (not to mention, I'd like for normalize to be a pure function that can't fail, and relying on a Windows API seems like it introduces the possibility of failure).

Are there any guides or reading I can look at to get my feet wet with this?

@joshtriplett
Copy link
Member

I'm going to defer to @retep998 here.

@carols10cents
Copy link
Member

Ping @retep998! Do you have any advice for @ThatsGobbles ?

@carols10cents carols10cents added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 22, 2018
@linclelinkpart5
Copy link
Author

Hello all,

I concur with the name change: clean is far more clear as to what the method is trying to do!

As for tests, I've been incorporating the examples from the linked page, but I'm running into issues with Windows paths. It seems that Path.components() trips up some of the edge cases, and since clean (formerly known as normalize) depends heavily on components, I'm not seeing correct results. Perhaps I'm missing something, either code- or understanding-wise?

/// ```
#[unstable(feature = "path normalize", issue = "47402")]
pub fn normalize(&self) -> PathBuf {
#[unstable(feature = "path cleaning", issue = "47402")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature names usually don't contain spaces.

tc!("/../../", "/");
} else {
// Drive-absolute paths.
tc!(r#"X:\ABC\DEF"#, r#"X:\ABC\DEF"#);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick. Unless your path contains a " you could just lose the #s

tc!(r"X:\ABC\DEF", r"X:\ABC\DEF");

I think this is much less noisy.

@BatmanAoD
Copy link
Member

Ping from triage! @kennytm and @joshtriplett , it looks like the author needs some help dealing with Windows paths. Can either of you recommend someone who might be able to help? Or should we tag a team? I've also asked on IRC.

@kennytm
Copy link
Member

kennytm commented Feb 13, 2018

@ThatsGobbles Could you list the test cases which tripped up components()?

@arazabishov
Copy link
Member

@ThatsGobbles Ping from the release team :) Did you have a chance to take a look into what @kennytm proposed to do?

@linclelinkpart5
Copy link
Author

Hi @kennytm and @arazabishov, apologies for the slow response, I found myself with little free time this week! Here's the current set of problem cases that I've discovered:

X:\ABC\DEF. .
X:DEF. .
\ABC\DEF. .
ABC\DEF. .
\\server\share\ABC. .
//server/share/ABC/DEF
\\.\X:\ABC\DEF. .
\\.\X:/ABC/DEF
\\.\X:\ABC\..\..\C:\
COM1
X:\COM1
X:COM1
valid\COM1
X:\notvalid\COM1
X:\COM1.blah
X:\COM1:blah
X:\COM1  .blah
\\.\X:\COM1
\\abc\xyz\COM1
\\?\X:\ABC\DEF. .
\\?\X:/ABC/DEF
\??\X:\ABC\DEF
\??\X:\
\??\X:
\??\X:\COM1
\??\X:\ABC\DEF. .
\??\X:/ABC/DEF
\??\X:\ABC\..\XYZ
\??\X:\ABC\..\..\..

All of the cases ending in '. .' failed due to those not getting normalized away. Most of the UNC paths did not trip up .components() per say, but when re-joining, there would be an unexpected trailing slash.

@pietroalbini pietroalbini added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 5, 2018
@pietroalbini
Copy link
Member

Ping from the release team! @kennytm the author seems to be still stuck, but replied to your request for information. Can you (or @retep998) help them out?

@kennytm kennytm self-assigned this Mar 6, 2018
@kennytm
Copy link
Member

kennytm commented Mar 10, 2018

@ThatsGobbles Hi sorry for long delay. I'm not totally familiar with Windows path, so perhaps @retep998 can give some more mentoring. (The failing test cases are in the source file already)

  1. Cases ending with . ..

    I think this components() should still return "DEF. ." as is, meaning the removal of the dots and spaces should be done inside the clean() function. You may want to add a sys::path::clean_file_name function which will remove . and from the last path component.

    // Pseudo-code. OsStr::trim_right does not exist, implement it yourself.
    #[cfg(windows)]
    pub fn clean_file_name(f: &OsStr) -> &OsStr {
        f.trim_right(&['.', ' '])
    }
    #[cfg(not(windows))]
    pub fn clean_file_name(f: &OsStr) -> &OsStr {
        f
    }

    You also need to add a test to ensure trailing dots and slashes in the middle are not trimmed out.

    tc!("X:\ABC...\DEF...", "X:\ABC...\DEF");
    tc!("X:\ABC...\DEF...\", "X:\ABC...\DEF...\");

    The components() iterator will not recognize the trailing slash. You might need to manually check self.inner.ends_with(MAIN_SEPARATOR).

  2. Prefix with / i.e. //server/share/ABC/DEF.

    The Project Zero blog suggests replacing all / with \ as the first step, which we will not be doing due to immutability 🙂. However, we may modify sys::path::parse_prefix to recognize / as the same as \.

    We need to add an argument, fn parse_prefix(path: &OsStr, is_cleaning: bool) -> Option<Prefix> to restrict the new behavior within the clean() function. This means you have to construct the Components instance yourself, instead of calling the public self.components()

  3. UNC with middle /

    Instead of always pushing item.as_os_str(), you may need to special-case it for UNC prefixes, so that the prefix \\server\share would be pushed separately as two components. This allows us to construct a \\server\share even if the source was \\server/share.

  4. COM1.

    You'll need to hard-code the reserved filenames.

    #[cfg(windows)]
    fn clean_reserved_path(filename: &OsStr) -> Option<PathBuf> {
        if let Some(capture) = filename.capture(r"^(COM[0-9¹²³]|…)(?:[:. ]*[:.].*)?$") {
           Some(r"\\.\" + capture.get(1))
        } else {
           None
        }
    }
    #[cfg(not(windows))]
    fn clean_reserved_path(_: &OsStr) -> Option<PathBuf> { 
        None
    }

    In your clean(), if you find the prefix is Drive Absolute, Drive Relative or Relative, you need to call clean_reserved_path() on the filename and return it directly if it is Some.

    match prefix {
        None | Some(Disk(_)) => {
            if let Some(cleaned) = sys::path::clean_reserved_path(file_name) {
                return cleaned;
            }
        }
        _ => {}
    }
  5. Trailing slashes

    I think normalizing \\foo into \\foo\ and \\foo\bar into \\foo\bar\ is OK.

@shepmaster shepmaster added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 18, 2018
@shepmaster
Copy link
Member

Ping from triage, @ThatsGobbles ! You've got some feedback from @kennytm; will you be able to address it in the near future?

@pietroalbini
Copy link
Member

Thank you so much for this PR @ThatsGobbles! Unfortunately, we haven't heard from you in a while, so I'm closing this to keep the list of PRs clean.

If you have time to work on this again please reopen the pull request though! We will be happy to review the new changes.

@shepmaster shepmaster added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.