-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
printf compatibility #5783
printf compatibility #5783
Conversation
04a662a
to
a7897d5
Compare
About the organization
Edit: as noted by @tertsdiepraam the quote issue was an oversight on my part About fuzzing
Edit: leak fixed in #5787 |
7b9a52e
to
22e7380
Compare
Re: quotes. These probably depend on locale. We chose to follow the C locale until we have proper locale support, which uses |
Interesting. I only had set |
5e35d02
to
ecfba2a
Compare
Fixed quotes |
GNU testsuite comparison:
|
ecfba2a
to
9f34735
Compare
I've restructured the code to make names more explicit, and added uucode tests to the partial parsing function. |
8bbf8f1
to
54b87f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Thanks! I think this is mostly correct, but I think it should be implemented slightly differently. A problem with this implementation is that the base is lost. Here's GNU:
> printf %d 010b
printf: ‘010b’: value not completely converted
8
So I think this is what should happen:
- Remove the base prefix to determine the base.
- Get the digits from the front.
- If anything is left, print the error.
- Convert the digits with
from_str_radix
.
This will also prevent us from trying to parse the same value multiple times, which feels a bit strange and requires us to keep both passes in sync.
Does that all make sense?
TIL I learned that coreutils inherited the "octal prefix mistake" from C.
Note that:
The code I implemented only deals with decimal numbers, as those are more complicated to parse (only one "-" optional prefix allowed, only one optional "." allowed, "-inf", "inf", "nan" — although those three aren't tied to any specific base they don't accept a base prefix).
Yes it does. I see two alternatives:
The second one makes more sense to me, I'll see what I can do and set this PR as draft in the meantime. Thanks for your feedback. |
Haha yeah and now we're stuck with it 😄
That sounds great! Thank you! |
54b87f4
to
61309c1
Compare
I have implemented a new number parser which can also cope with hexadecimal floats. I have put it in another module as to differentiate between the part which emits error messages and the one which only does some processing. |
d5b83d4
to
b7e6488
Compare
As far as I know, the only remaining issue before we can activate the printf fuzzer (in matching stderr mode) is that double quotes are escaped when printed by uutils while they are not when printed by GNU coreutils. And also |
b7e6488
to
a6f527c
Compare
The latest push just adds a new test for the error message corresponding to a larger single character: $ target/debug/coreutils printf %d "'abc"
printf: warning: bc: character(s) following character constant have been ignored
97% |
GNU testsuite comparison:
|
The failing chown test seems unrelated to those changes and fails for me the same way locally with and without this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! This is great stuff! I have some small suggestions, but this could also be merged as is in my opinion.
ParseError::PartialMatch(v, rest) => { | ||
if input.starts_with('\'') { | ||
show_warning!( | ||
"{}: character(s) following character constant have been ignored", | ||
&rest, | ||
); | ||
} else { | ||
show_error!("{}: value not completely converted", input.quote()); | ||
} | ||
v | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to split these cases into two enum variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with it yesterday but went back: this is a partial match after all, just like the others. The fact that the higher layers choose to print a different message has little to do with the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense, but they could be qualified as a numeric partial match and a character partial match, which are two quite different operations. I'm mostly suggesting it because it would save us from storing but discarding rest
. I'm happy with this though, so let's leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that rest
is only a &str
, no copy is ever made of the character data. I think the cost is insignificant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I didn't mean a performance overhead, just an additional thing to pass around
} | ||
let (mut index, mut int, mut point, mut frac, mut prec, mut ended) = | ||
(0, 0u64, false, 0u64, 0, false); | ||
for c in rest.chars() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things about this loop:
- It might be easier to use bytes instead of chars because we only care about the digits which are in ascii range.
- I think it might be nicer to split this loop into before and after the integral part, so encountering a
.
will break the first loop and will go to the second. The current version is a bit hard to follow with all the mutable variables that keep track of the state (though I can't find any problems with it :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About chars / bytes : sure, but we might encounter a non-ASCII char in the process and we have to deal with it atomically.
About separating the loops, let me check if it makes things simpler.
Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we might encounter a non-ASCII
I think we won't split on that if we encounter it, because it won't be a digit or a point or something else we can use. I think it should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using str/String
altogether. Then for printing in error messages we do a to_string_lossy
or something like that. It's a bit messy unfortunately.
For example, I bet that printf
will happily accept something that starts with digits and then has some invalid utf8. I can trigger this in fish
:
# GNU
> printf %d 10\xf0
printf: ‘10\360’: value not completely converted
10
# uutils
> printf %d 10\xf0
error: invalid UTF-8 was detected in one or more arguments
Usage: target/debug/coreutils printf FORMATSTRING [ARGUMENT]...
target/debug/coreutils printf FORMAT [ARGUMENT]...
target/debug/coreutils printf OPTION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tertsdiepraam Breaking the loop in two parts has indeed decreased the complexity a lot. It also made a better structure to add inline comments. Thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using
str/String
altogether. Then for printing in error messages we do ato_string_lossy
or something like that. It's a bit messy unfortunately.
Agreed, but it comes back to the previous question: do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?
Anyways, as far as this parser is concerned, given that it receives a &str
as input, I suggest to keep char
which is more readable and should infer no performance penalty, and switch to u8
if at some point you decide to manipulate &[u8]
inside of printf
for example. The change would be easy and feel natural then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?
So far, the answer has been that we do indeed want to be compatible, but of course the intended usage has more priority. It's difficult to tell what even is the intended usage at this point, because the coreutils have accumulated so much functionality. I think the strength of this particular project lies in the compatibility, even though I often wish I could change (i.e. "fix") the behaviour. But I'm always happy to discuss specific cases :)
527b015
to
d14c464
Compare
Btw, are different parameters used for configuring Clippy? The windows running requires that I had a |
GNU testsuite comparison:
|
d14c464
to
1ec5b85
Compare
The failing macos/x86_64 check seems unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final few comments I think, just some super minor things. This is excellent and I want to merge it as soon as possible.
The parser can parse integral and floating point numbers as expected by the coreutils `printf` command.
The error messages are more compliant with GNU coreutils. Also, floating hexadecimal numbers are now supported in `printf`.
1ec5b85
to
a85a792
Compare
GNU testsuite comparison:
|
Great! Thank you! |
This makes printf more compatible with its GNU coreutils counterpart.