Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang adds "noundef" annotation to char arguments #56551

Open
RalfJung opened this issue Jul 15, 2022 · 13 comments
Open

Clang adds "noundef" annotation to char arguments #56551

RalfJung opened this issue Jul 15, 2022 · 13 comments

Comments

@RalfJung
Copy link
Contributor

RalfJung commented Jul 15, 2022

I think the way clang translates the following C code to LLVM IR is incorrect:

char id(char c) {
    return c;
}

void my_memcpy(char *src, char *dst, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = id(src[i]);
    }
}

The resulting IR defines id as @id(i8 noundef signext %0). The noundef is what I am concerned by. This translation means calling my_memcpy as follows leads to UB, since some of the bytes being copied here are undef or poison (namely, they are padding):

struct S {
    uint8_t f1;
    uint16_t f2;
};

void testcase() {
    struct S s, s_copy;
    s.f1 = 0;
    s.f2 = 0;
    my_memcpy((char*)&s, (char*)&s_copy, sizeof(struct S));
}

If I understand the C standard correctly, this program (running testcase) is entirely well-defined. In particular, C17 6.2.6.1 §5 says (emphasis mine)

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined.

But we are using a character type here. The standard also explicitly says in 6.2.6.2 that char types have no padding bits. So I don't think there is any room here for UB to arise when copying arbitrary data (including uninitialized memory) at char type. Therefore clang should not add noundef to character type variables.

Furthermore, I am not entirely sure what the status of this proposal is, but if it has been accepted, then I am not sure that adding noundef to any other integer type is correct, either. That proposal states explicitly

None of the integral types have extraordinary values.

And at least for C++, https://eel.is/c++draft/basic.fundamental#4 has a note on padding in integer types stating

Padding bits have unspecified value, but cannot cause traps.

So, at least for C++, I cannot see a justification for why clang adds noundef to all integer types. For non-character integer types in C, the standard is not clear enough for me to be sure either way.

Cc @aqjune @nunoplopes

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 15, 2022

@llvm/issue-subscribers-clang-codegen

@nikic
Copy link
Contributor

nikic commented Jul 15, 2022

For C++ the justification is given by https://eel.is/c++draft/basic.indet. It's very clear about anything but unsigned char/std::byte being UB, and indeterminate unsigned char is only not UB for the specific set of operations listed, which notably does not include argument passing / returning.

At least as far as C++ is concerned, the indeterminate value UB doesn't have anything to do with trap representations.

@RalfJung
Copy link
Contributor Author

Oh I see, thanks for pointing that out. I have no idea how that can be true at the same time as the statements in basic.fundamental. The wonders of axiomatic specifications...

That would mean unsigned char / byte loads in C++ cannot be made noundef (once such a marker on load exists in the future), but noundef on arguments and return types indeed seems justified.

C doesn't have a clear enumeration like that to my knowledge, so it is less clear there. Documents like this make me think loading uninit data into a char is allowed (i.e., at least as much as C++ allows), and then I would usually think of course I can also pass them to another function, but clearly C++ doesn't satisfy that expectation...

@alercah
Copy link
Contributor

alercah commented Oct 8, 2022

My LLC/C++ is rusty, but I think that you're confusing indeterminate values and trap representations: a trap representation in C is an invalid but defined bit-pattern. C explicitly permits implementations to choose an integer representation such that you can have a trap representation that differs from a calid representation only by padding. C++ explicitly prohibits this, so you can safely write arbitrary bits to integer padding.

@RalfJung
Copy link
Contributor Author

RalfJung commented Oct 8, 2022

Those two notions are related though -- given the fuzziness of the standard, "type has no trap representation" is a hint indicating that it has no "strange" (indeterminate) values.

I didn't make this up, there are C committee discussions about indeterminate values that end up also discussing trap representations: https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm, https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_451.htm.

@alercah
Copy link
Contributor

alercah commented Oct 8, 2022

If I understand correctly, it's because the questions being asked about indeterminate values are ill-formed with respect to types with trap representations, because any read of an indeterminate value is UB for them (the compiler can just pick the result of the read to be the trap representation, and reading a trap representation is UB).

@RalfJung
Copy link
Contributor Author

RalfJung commented Apr 22, 2024

The question remains whether there's any way for the program above to have UB in C -- either in its form as given, or in the variant where id is inlined. (It seems like in C++, the inlined variant would be UB-free.) clang seems to say that the non-inlined version is UB, but I don't see what that is based on. The standard says (emphasis mine)

Certain object representations need not represent a value of the object type. If such a representation
is read by an lvalue expression that does not have character type, the behavior is undefined.

IOW, reading a non-value representation with an lvalue expression that has character type, like my example does, is well-defined. Therefore I think clang's noundef here is wrong.

@chorman0773
Copy link

FTR, in C also,

character types have ... no trap representations.

So an indeterminate value is just straight up an unspecified value for char. All character types lack any trap representation.

@aqjune
Copy link
Contributor

aqjune commented May 24, 2024

Hi,

I wonder whether this story also implies that padding bits should not be undef.
Padding bits in C++ have unspecified value, and if I understand unspecified value correctly, doing any operation should not raise undefined behavior unless the operation was supposed to do so on a 'plain' (non-unspecified, non-trap) value.
However, using undef in LLVM violates this condition because conditional branching on undef is UB whereas any plain i1 value cannot cause br to raise UB.

If this reasoning is right, we can think about an alternate solution as well: initializing padding bits with zero, and keeping the noundef at i8 args. Actually this will not only fix the reported issue but also fix this incorrect lowering of padding bits to undef.

@RalfJung
Copy link
Contributor Author

If this reasoning is right, we can think about an alternate solution as well: initializing padding bits with zero, and keeping the noundef at i8 args. Actually this will not only fix the reported issue but also fix this incorrect lowering of padding bits to undef.

I think that would still be wrong, since one should be able to copy uninitialized memory with char.

@aqjune
Copy link
Contributor

aqjune commented May 24, 2024

In the general case, I think the arguments slightly diverge in C and C++, which is what this thread has been discussing as well.

For C:

  • If there is no trap representation for non-char int types at all, it would mean that all uninitialized int variables will have unspecified values since indeterminate value is either unspecified value or trap repr. Since unspecified values are well-defined values, conceptually like freeze poison, having the noundef attributes at non-char int arguments will still hold because such value cannot appear from beginning. However, SROA should be fixed to lower load on an uninitialized int variable to some random, well-defined number rather than undef.
  • If non-char int types do have trap representations, having noundef for non-char types can be justified by C17 6.2.6.1 §5. But, for the char type, I think this is still slightly complicated due to the following problem.
char c = *(char*)p; // p is, say, int32_t* storing an indeterminate value
// If the above statement succeeded, c must be unspecified value, because char type doesn't have trap representation.

Note that if c was passed to a function f(char):

f(c);

Having noundef at the argument of f will still work because c has a well-defined value.

However, another problem with this C standard semantics understanding is that supporting elimination of a redundant store-load operation using char buffer becomes hard:

char c = *(char*)p; // p is, say, int32_t* storing an indeterminate value
// If the above statement succeeded, c must be unspecified value, because char type doesn't have trap representation.
*(char*)p = c; // storing this back

This round-trip pair cannot be removed because after the removal the byte at p will be more undefined. Again, we need to disable this optimization I think (which is bad).

For C++:

  • Having noundef for non-char int types are justified, as discussed above.
  • Having noundef for char type <- I could not find a reasonable argument to keep this. :) Should we remove this..?

@RalfJung
Copy link
Contributor Author

RalfJung commented May 26, 2024

I am only talking about C in the following.

If there is no trap representation for non-char int types at all, it would mean that all uninitialized int variables will have unspecified values since indeterminate value is either unspecified value or trap repr.

I think that is a reasonable interpretation of the standard, but unfortunately, the committee decided to take a much more creative stance in this defect report (and even more unfortunately, the standard never got clarified to make this explicit):

At any time that the compiler can determine that an object contains an indeterminate value, even if the type of the object does not have trap representations, the object may change value arbitrarily. Thus p need not have the same values at lines X and Y. As soon as the object is given an explicit value, this behaviour stops.

We are left with more questions than before. In particular, given all the effort the standard makes to give char special powers for directly accessing raw memory representations, it would be reasonable to expect that the above clarification does not apply to char objects, but who knows.

The interpretation currently taken by clang however makes it entirely impossible to implement memcpy inside the language as a byte-for-byte copy at char type. (clang only adds noundef to function arguments, but unlike C++ nothing in the C standard special-cases function arguments, so if noundef is okay for char arguments then it is okay for char everywhere.) Completely independent of what the standard says, I think it is clear that the actual C language out there that programmers use must be able to implement memcpy.

tl;dr what clang is doing here is probably not backed by the C standard (but the standard is too ambiguous to be sure), and certainly going against programmer expectations.

@aqjune
Copy link
Contributor

aqjune commented May 30, 2024

I think that is a reasonable interpretation of the standard, but unfortunately, the committee decided to take a much more creative stance in this defect report (and even more unfortunately, the standard never got clarified to make this explicit):

Thanks, then I must slightly fix my previous phrase below

Since unspecified values are well-defined values, conceptually like freeze poison, ...

to embrace more elastic interpretation by removing 'conceptually like poison'. I think my previous sentence 'doing any operation should not raise undefined behavior unless the operation was supposed to do so on a 'plain' (non-unspecified, non-trap) value.' from this comment is still valid.

I understood that a byte-copying code using char should be a valid code. And I could not find a statement about specializing passing char as a function argument in C and C++ either. The byte-copying code using the id function should have the same semantics (or, precisely speaking, this seems to be a less surprising interpretation).

char-typed function arg can have noundef in C if:

(1) Padding bits in LLVM are not undef anymore and become more defined bits such as '0' (which can affect generated code quality, I believe). Otherwise copying a struct with padding bits using char (with the id function) becomes undefined (a relevant discussion: https://groups.google.com/a/chromium.org/g/cxx/c/_CQ7g0A8H4w/m/2wmnyLa5BQAJ). Actually, undef padding bits not only make memcpy problematic but also make functions like memcmp have undefined behavior. Even if the result of memcmp might not make sense at all when padding bits are involved, making it undefined doesn't seem to be consistent with the fact that using unspecified value should not be undefined.

(2) the meaning of 'reading a memory location storing trap representation of non-char type as a char type' became clear, and if it means it successfully reads the location, it should immediately freeze the value because char value should always be well-defined (I used the term 'well-defined' here as a concept including unspecified values, to be clear.) (an example is in my comment). This depends on whether trap representation should always be some concrete bit pattern, or it can be a conceptual value like 'poison'. I think the interpretation of LLVM is more like latter, to facilitate optimizations.

Then, would the next action that should be taken now be removing noundef from char function arguments of C and C++ functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants