Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

printf compatibility #5783

Merged
merged 3 commits into from
Jan 10, 2024
Merged

Conversation

samueltardieu
Copy link
Contributor

@samueltardieu samueltardieu commented Jan 4, 2024

This makes printf more compatible with its GNU coreutils counterpart.

  • format: %c prints the first character of a string
  • format: parse as many characters as possible in numbers and make the error messages more compatible with GNU coreutils

@samueltardieu samueltardieu marked this pull request as draft January 4, 2024 21:22
@samueltardieu samueltardieu marked this pull request as ready for review January 4, 2024 22:34
@samueltardieu
Copy link
Contributor Author

samueltardieu commented Jan 4, 2024

About the organization

The split between the printf application and core functionality seems weird sometimes, as error messages are emitted from the core library. Maybe it would make sense if the parsing functionalities were either brought back into the printf application or returned Result objects in order to be able to format error messages in the printf application specifically.

For example, printf uses ‘XXX’ quotes which AFAICT were not present in the core library; emitting the message from there (uucore::features::format::argument) forces to use those quotes in uucore.

Edit: as noted by @tertsdiepraam the quote issue was an oversight on my part

About fuzzing

The fuzzing now often fails because too many file descriptors are opened. I don't know if there is a descriptor leak in the fuzzing framework itself, or if this is due to the launch of the external GNU coreutils commands.

Edit: leak fixed in #5787

@samueltardieu samueltardieu force-pushed the printf-compatibility branch 3 times, most recently from 7b9a52e to 22e7380 Compare January 4, 2024 22:51
@tertsdiepraam
Copy link
Member

Re: quotes. These probably depend on locale. We chose to follow the C locale until we have proper locale support, which uses '.

@samueltardieu
Copy link
Contributor Author

Re: quotes. These probably depend on locale. We chose to follow the C locale until we have proper locale support, which uses '.

Interesting. I only had set LC_NUMERIC=C, but by setting all I get different results in the fuzzer but not in the tests. I'll fix the quotes.

@samueltardieu samueltardieu marked this pull request as draft January 5, 2024 08:17
@samueltardieu samueltardieu force-pushed the printf-compatibility branch 2 times, most recently from 5e35d02 to ecfba2a Compare January 5, 2024 08:22
@samueltardieu samueltardieu marked this pull request as ready for review January 5, 2024 08:22
@samueltardieu
Copy link
Contributor Author

Fixed quotes

@samueltardieu samueltardieu marked this pull request as draft January 5, 2024 08:40
Copy link

github-actions bot commented Jan 5, 2024

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

@samueltardieu samueltardieu marked this pull request as ready for review January 5, 2024 08:59
@samueltardieu
Copy link
Contributor Author

I've restructured the code to make names more explicit, and added uucode tests to the partial parsing function.

@samueltardieu samueltardieu force-pushed the printf-compatibility branch 3 times, most recently from 8bbf8f1 to 54b87f4 Compare January 5, 2024 09:46
Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks! I think this is mostly correct, but I think it should be implemented slightly differently. A problem with this implementation is that the base is lost. Here's GNU:

> printf %d 010b
printf: ‘010b’: value not completely converted
8

So I think this is what should happen:

  1. Remove the base prefix to determine the base.
  2. Get the digits from the front.
  3. If anything is left, print the error.
  4. Convert the digits with from_str_radix.

This will also prevent us from trying to parse the same value multiple times, which feels a bit strange and requires us to keep both passes in sync.

Does that all make sense?

@samueltardieu
Copy link
Contributor Author

Hi! Thanks! I think this is mostly correct, but I think it should be implemented slightly differently. A problem with this implementation is that the base is lost. Here's GNU:

> printf %d 010b
printf: ‘010b’: value not completely converted
8

TIL I learned that coreutils inherited the "octal prefix mistake" from C.

So I think this is what should happen:

1. Remove the base prefix to determine the base.
2. Get the digits from the front.
3. If anything is left, print the error.
4. Convert the digits with `from_str_radix`.

Note that:

  • The "-" prefix must be extracted first as it happens earlier.
  • It has to account that some bases (dec/hex) can have a fractional part while others (bin/oct) cannot.

The code I implemented only deals with decimal numbers, as those are more complicated to parse (only one "-" optional prefix allowed, only one optional "." allowed, "-inf", "inf", "nan" — although those three aren't tied to any specific base they don't accept a base prefix).

This will also prevent us from trying to parse the same value multiple times, which feels a bit strange and requires us to keep both passes in sync.

Does that all make sense?

Yes it does. I see two alternatives:

  • Keep the parser I wrote for the decimal case. It is only called when the number could not be parsed as-is, so those situations can be considered unlikely. Handle other bases numbers differently.
  • Write a parser accomodating all cases and return the parsed value as well as the rest of the string if it cannot be parsed.

The second one makes more sense to me, I'll see what I can do and set this PR as draft in the meantime.

Thanks for your feedback.

@samueltardieu samueltardieu marked this pull request as draft January 8, 2024 09:32
@tertsdiepraam
Copy link
Member

TIL I learned that coreutils inherited the "octal prefix mistake" from C.

Haha yeah and now we're stuck with it 😄

Write a parser accomodating all cases and return the parsed value as well as the rest of the string if it cannot be parsed.

That sounds great! Thank you!

@samueltardieu samueltardieu marked this pull request as ready for review January 8, 2024 14:18
@samueltardieu
Copy link
Contributor Author

I have implemented a new number parser which can also cope with hexadecimal floats. I have put it in another module as to differentiate between the part which emits error messages and the one which only does some processing.

@samueltardieu samueltardieu force-pushed the printf-compatibility branch 2 times, most recently from d5b83d4 to b7e6488 Compare January 8, 2024 14:30
@samueltardieu
Copy link
Contributor Author

samueltardieu commented Jan 8, 2024

As far as I know, the only remaining issue before we can activate the printf fuzzer (in matching stderr mode) is that double quotes are escaped when printed by uutils while they are not when printed by GNU coreutils. And also \c (suppress further output) doesn't seem to be honored properly.

@samueltardieu
Copy link
Contributor Author

The latest push just adds a new test for the error message corresponding to a larger single character:

 $ target/debug/coreutils printf %d "'abc"
printf: warning: bc: character(s) following character constant have been ignored
97%

Copy link

github-actions bot commented Jan 8, 2024

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

@samueltardieu
Copy link
Contributor Author

The failing chown test seems unrelated to those changes and fails for me the same way locally with and without this patch.

Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This is great stuff! I have some small suggestions, but this could also be merged as is in my opinion.

src/uucore/src/lib/features/format/argument.rs Outdated Show resolved Hide resolved
Comment on lines +114 to +124
ParseError::PartialMatch(v, rest) => {
if input.starts_with('\'') {
show_warning!(
"{}: character(s) following character constant have been ignored",
&rest,
);
} else {
show_error!("{}: value not completely converted", input.quote());
}
v
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to split these cases into two enum variants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with it yesterday but went back: this is a partial match after all, just like the others. The fact that the higher layers choose to print a different message has little to do with the parsing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense, but they could be qualified as a numeric partial match and a character partial match, which are two quite different operations. I'm mostly suggesting it because it would save us from storing but discarding rest. I'm happy with this though, so let's leave it as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that rest is only a &str, no copy is ever made of the character data. I think the cost is insignificant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I didn't mean a performance overhead, just an additional thing to pass around

}
let (mut index, mut int, mut point, mut frac, mut prec, mut ended) =
(0, 0u64, false, 0u64, 0, false);
for c in rest.chars() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things about this loop:

  • It might be easier to use bytes instead of chars because we only care about the digits which are in ascii range.
  • I think it might be nicer to split this loop into before and after the integral part, so encountering a . will break the first loop and will go to the second. The current version is a bit hard to follow with all the mutable variables that keep track of the state (though I can't find any problems with it :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About chars / bytes : sure, but we might encounter a non-ASCII char in the process and we have to deal with it atomically.

About separating the loops, let me check if it makes things simpler.

Thanks for the review!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we might encounter a non-ASCII

I think we won't split on that if we encounter it, because it won't be a digit or a point or something else we can use. I think it should be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using str/String altogether. Then for printing in error messages we do a to_string_lossy or something like that. It's a bit messy unfortunately.

For example, I bet that printf will happily accept something that starts with digits and then has some invalid utf8. I can trigger this in fish:

# GNU
> printf %d 10\xf0
printf: ‘10\360’: value not completely converted
10
# uutils
> printf %d 10\xf0
error: invalid UTF-8 was detected in one or more arguments

Usage: target/debug/coreutils printf FORMATSTRING [ARGUMENT]...
       target/debug/coreutils printf FORMAT [ARGUMENT]...
       target/debug/coreutils printf OPTION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tertsdiepraam Breaking the loop in two parts has indeed decreased the complexity a lot. It also made a better structure to add inline comments. Thanks for the suggestion!

Copy link
Contributor Author

@samueltardieu samueltardieu Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using str/String altogether. Then for printing in error messages we do a to_string_lossy or something like that. It's a bit messy unfortunately.

Agreed, but it comes back to the previous question: do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?

Anyways, as far as this parser is concerned, given that it receives a &str as input, I suggest to keep char which is more readable and should infer no performance penalty, and switch to u8 if at some point you decide to manipulate &[u8] inside of printf for example. The change would be easy and feel natural then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?

So far, the answer has been that we do indeed want to be compatible, but of course the intended usage has more priority. It's difficult to tell what even is the intended usage at this point, because the coreutils have accumulated so much functionality. I think the strength of this particular project lies in the compatibility, even though I often wish I could change (i.e. "fix") the behaviour. But I'm always happy to discuss specific cases :)

@samueltardieu samueltardieu force-pushed the printf-compatibility branch 2 times, most recently from 527b015 to d14c464 Compare January 9, 2024 11:13
@samueltardieu
Copy link
Contributor Author

Btw, are different parameters used for configuring Clippy? The windows running requires that I had a #[allow(clippy::cognitive_complexity)] to the parse function, while I cannot reproduce this on Linux.

Copy link

github-actions bot commented Jan 9, 2024

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

@samueltardieu
Copy link
Contributor Author

The failing macos/x86_64 check seems unrelated.

Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final few comments I think, just some super minor things. This is excellent and I want to merge it as soon as possible.

src/uucore/src/lib/features/format/num_parser.rs Outdated Show resolved Hide resolved
The parser can parse integral and floating point numbers as expected by
the coreutils `printf` command.
The error messages are more compliant with GNU coreutils.
Also, floating hexadecimal numbers are now supported in
`printf`.
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@tertsdiepraam tertsdiepraam merged commit 0071442 into uutils:main Jan 10, 2024
57 of 58 checks passed
@tertsdiepraam
Copy link
Member

Great! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants