From 567bad87f83986999444e29af135a74a0e5dbc47 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Thu, 20 Sep 2018 11:34:50 +0200 Subject: [PATCH 01/18] wip --- text/0000-symbol-name-mangling-v2.md | 499 +++++++++++++++++++++++++++ 1 file changed, 499 insertions(+) create mode 100644 text/0000-symbol-name-mangling-v2.md diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md new file mode 100644 index 00000000000..a1db3e51de8 --- /dev/null +++ b/text/0000-symbol-name-mangling-v2.md @@ -0,0 +1,499 @@ +- Feature Name: symbol_name_mangling_v2 +- Start Date: 2018-10-01 +- RFC PR: (leave this empty) +- Rust Issue: (leave this empty) + +# Summary +[summary]: #summary + +This RFC proposes a new mangling scheme that describes what the symbol names for everything generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling]. + +# Motivation +[motivation]: #motivation + +Due to its ad-hoc nature, the compiler's current name mangling scheme has a +number of drawbacks: + +- It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool. +- Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name. +- The current scheme is inconsistent: most paths use Itanium style encoding, but some of them don't. +- The symbol names it generates can contain `.` characters which is not generally supported on all platforms. \[[1][gas]\]\[[2][lld-windows-bug]\] \[[3][thin-lto-bug]\] + +[gas]: https://sourceware.org/binutils/docs/as/Symbol-Names.html#Symbol-Names +[lld-windows-bug]: https://github.com/rust-lang/rust/issues/54190 +[thin-lto-bug]: https://github.com/rust-lang/rust/issues/53912 + +The proposed scheme solves these problems: + +- It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time. +- It encodes information about generic parameters in a reversible way. +- It has a consistent definition that does not rely on pretty-printing certain language constructs. +- It generates symbols that only consist of the characters `A-Z`, `a-z`, `0-9`, `_`, and `$`. + +This should make it easier for third party tools to work with Rust binaries. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +The following section will lay out the requirements for a name mangling scheme and then introduce the actual scheme through a series of ever more complex examples. + +## Requirements for a Symbol Mangling Scheme + +A symbol mangling scheme has a few goals, one of them essential, the rest of them desirable. The essential one is: + +- The scheme must provide an unambiguous string encoding for everything that can end up in a binary's symbol table. + +"Unambiguous" means that no two distinct compiler-generated entities (that is, mostly object code for functions) must be mapped to the same symbol name. This disambiguation is the main purpose of the hash-suffix in the current, legacy mangling scheme. The scheme proposed here, on the other hand, achieves it in a way that allows to also satisfy a number of additional desirable properties of a mangling scheme: + + - A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix. + + - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name even for simple cases. + + - A mangling should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`, and `$`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). + + - The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for compiler and linker) and that generating a symbol should not be too computationally expensive. + +Note that a source-level definition can contain components that will not show up in symbol names, like lifetimes (as in `fn foo<'a>()`). It is an explicit non-goal of this RFC to define a mangling for cases like the above. One might want to cover them "for completeness" but they are not actually needed. + + +## The Mangling Scheme by Example + +This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and see how that already involves things that might be surprising. + +### Free-standing Functions and Statics + +A free-standing function is fully identified via its absolute path. For example, the following function + +```rust +mod foo { + fn bar() {} +} +``` + +has the path `foo::bar` and `N3foo3barE` is a mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is the encoding that the [Itanium C++ ABI][itanium-mangling] name mangling scheme uses for "nested names" (i.e. paths). The scheme proposed here will also use this format. + +However, the symbol name above does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define `mod foo { fn bar() {} }` somewhere. So in order to avoid conflicts in such cases, fully qualified names always include the crate name and disambiguator, as in `N15mycrate_4a3b56d3foo3barE` (the crate disambiguator is used to disambiguate different versions of the same crate. It is an existing concept and not introduced by this RFC). + +There is one more possible ambiguity that we have to take care of: Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form `crate_id::foo::bar` not uniquely identifying the item `bar` because the following snippet is legal Rust code: + +```rust +fn foo() { + fn bar() {} +} + +mod foo { + fn bar() {} +} +``` + +The function `foo` lives in the value namespaces while the module `foo` lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct `bar` functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name `N15mycrate_4a3b56d3fooF3barFE` and case two get the name `N15mycrate_4a3b56d3foo3barFE` (notice the difference: `3fooF` vs `3foo`). + +As opposed to C++ and other languages that support function overloading, we don't need to include the argument types in the symbol name. Rust does not allow two functions of the same name but different arguments. + +The final symbol name for the function would also include the prefix `_R` that is common to all symbol names generated by this scheme: + +``` + _RN15mycrate_4a3b56d3foo3barFE` + + <><--------------------------> + | | +prefix fully qualified name + + +``` + +### Generic Functions + +Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are listed as suffix, starting with `I`, after the name they belong to. So the instance + +```rust +std::mem::align_of:: +``` + +would be mangled to + +``` +_RN12std_a1b2c3d43mem8align_ofFIdEE + ^^^ + ||| + start of argument list ---+|+--- end of argument list + | + f64 +``` + +where `I` starts the list of arguments, `d` designates `f64` and `E` ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future we might also need to represent *values* when const generics get added to the language.) These kinds of types are: + + - basic types (`char`, `()`, `str`, `!`, `i8`, `i16`, ...) + - reference and pointers types, shared and `mut` + - tuples + - arrays, with and without fixed size (e.g. `[u8]`, `[u8; 17]`, or as part of a slice type `&[char]`) + - structs, enums, closures, and other named types, possibly with their own set of type arguments + - function types such as `fn(&i32) -> u16` + +Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their fully qualified name (plus arguments) like is done for function symbols. Composites like references, tuples, and function types all follow simple grammar given in the reference-level explanation below. Here are some examples manglings to get a general feel of what they look like: + + - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofFIjEE` + - `std::mem::align_of::<&char>`: `_RN12std_a1b2c3d43mem8align_ofFIRcEE` + - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofFIN12std_a1b2c3d43mem12DiscriminantEEE` + - `std::mem::align_of::<&mut (&str,())>`: `_RN12std_a1b2c3d43mem8align_ofFIWTRrvEEE` + +There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function `foo` which gets used as `foo` in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an `$` suffix to the symbol. So for example the mangling for `std::mem::align_of::` would actually look like this: + +``` +_RN12std_a1b2c3d43mem8align_ofFIjEE$foo_a1b2c3d4 (for crate "foo/a1b2c3d4") +_RN12std_a1b2c3d43mem8align_ofFIjEE$bar_11223344 (for crate "bar/11223344") +``` + +### Closures and Closure Environments + +The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme takes a simple approach here: Each closure gets assigned an index (unique within the item defining it) and from that we generate a name of the form `c$`. The `$` makes sure that the name cannot clash with user-defined names. The full name of a closure is then constructed like for any other named item: + +```rust +mod foo { + fn bar(x: u32) { + let a = |x| { x + 1 }; // ~ c$0 + let b = |x| { x + 2 }; // ~ c$1 + + a(b(x)) + } +} + +``` + +In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `c$0` and the second one the name `c$1`. Their full names would then be `N15mycrate_4a3b56d3foo3barF3c$0FE` and `N15mycrate_4a3b56d3foo3barF3c$1FE` respectively. The type of their environment would be the same, except for not having the `F` suffix to their local name. + +### Inherent Methods + +Inherent methods (that is, methods that are not part of a trait implementation) are represented by a symbol of the form: + +``` +_RM [] [] +``` + +The `M` designates the symbol as an inherent method. The self-type is encoded like any other type argument and already contains the concrete type arguments of the `impl` defining the method. The method name is unique among all inherent methods for the given type, so we don't need to further qualify it. The method can have type arguments of its own. These are encoded like other argument lists as `I + E`. If the method is generic in any way, it will also need the instantiating crate suffix, like any other generic item. + +Here's an example for a non-generic method: + +```rust +mod foo { + struct Bar; + + impl Bar { + pub fn panic_please() { panic!() } + } +} +``` + +The resulting symbol name looks like: + +``` +_RMN15mycrate_4a3b56d3foo3BarE12panic_please + + <-------------------------><------------> + self-type method name +``` + +A method with a generic self-type is a bit longer, since it also contains the instantiating-crate-suffix: + +```rust +mod foo { + struct Bar; + + impl Bar { + pub fn panic_please() { panic!() } + } +} +``` + +The symbol for `foo::Bar::panic_please` would look like this: + +``` +_RMN15mycrate_4a3b56d3foo3BarIcEE12panic_please$downstream_crate_x_abcd1234 + <----------------------------><------------><--------------------------> + self-type method name instantiating crate +``` + + +### Trait Methods + +Trait methods are similar to inherent methods, but in addition to the self-type the symbol name must also contain the trait being implemented: + +``` +_RX [] [] +``` + +The `X` signifies that this is a trait method. The trait being implemented is encoded `N + [I+ E] E`, like a named type. Here is a complex example with generics in all the places: + +```rust +mod foo { + trait Foo { + fn id(x: T) -> T; + } +} + +mod bar { + struct Bar; +} + +mod baz { + impl Foo for Bar { + fn id(x: T) -> T { x } + } +} +``` + +The mangling for ` as Foo>::id::` would be: + +``` +_RXN15mycrate_4a3b56d3foo3FooIiEEN15mycrate_4a3b56d3bar3BarIcEE2idIjE$downstream_crate_x_abcd1234 + <----------------------------><----------------------------><-><-><--------------------------> + trait self-type method instantitating crate +``` + +One thing that's interesting here is that `baz`, the module the impl is situated in, does not show up anywhere in the mangled name. + +### Items Within Specialized Trait Impls + +In Rust one can define items within generic items, e.g. functions or impls, like in the following example: + +```rust +fn foo(a: T) -> (u32, T) { + static mut X: u32 = 0; + + unsafe { + X += 1; + (X, a) + } +} +``` + +The `X` here (or any other such nested definition) does not inherit the generic context. `X` is non-generic, and a function defined in its place would be too. Consequently, when giving the path to something defined within a generic item, one does not specify the generic arguments because they add no information. The fully qualified name of `X` is thus `my_crate/a1b2c3d4::foo::X` and its symbol name: + +``` +_RN15mycrate_4a3b56d3fooF1XFE +``` + +However, there is at least one case where the type arguments *do* matter for a defintion like this, and that is when impl specialization is used. Consider the following piece of code: + +``` +trait Foo { + fn foo() -> T; +} + +struct Bar(T); + +impl Foo for Bar { + default fn foo() -> T { + static MSG: &str = "sry, no can do"; + panic!("{}", MSG) + } +} + +impl Foo for Bar { + fn foo() -> T { + static MSG: &str = "it's a go!"; + println!("{}", MSG); + T::default() + } +} + +``` + +Notice that both `MSG` statics have the path `::foo::MSG` if you just leave off the type arguments. However, we also don't have any concrete types to substitute the arguments for. Therefore, we have to encode the type parameters and their bounds for cases like this so that the symbol name will be a mangling of something like ` as Foo>::foo::MSG where T: Clone`: + +``` +_RXI1TIN12std_a1b2c3d47default7DefaultEEEN15mycrate_4a3b56d3FooI1TEEN15mycrate_4a3b56d3BarI1TEE3foo3MSG + + -------------------------------------- "where clause" + -- T + ---------------------------------- bounds to T + -------------------------------- std::default::Default + --------------------------- Foo + --------------------------- Bar + foo ---- + MSG ---- + +``` + + +### Compiler-generated Items (Drop-Glue, Shims, etc) + +The compiler generates a number of things that can end up needing an entry in the symbol table: + + - Drop-glue is what recursively calls `Drop::drop()` for components of composite type. Generating symbol names for it is straightforward. They are of the form `_RG` where `` is the usual mangling as used for generic arguments. + + - Various "shims", that is, compiler-generated implementations of built-in traits like `Fn`, `FnMut`, `FnOnce`, or `Clone`, or for dynamic dispatch via trait objects. These are similar in structure to drop glue. Their precise mangling is specified in the reference-level explanation below. + +### Unicode Identifiers + +Rust allows unicode identifiers but our character set is restricted to ASCII alphanumerics, `_`, and `$`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string + +```rust +"Gödel, Escher, Bach" +``` + +is encoded as + +```rust +"Gdel, Escher, Bach-d3b" +``` + +which, as opposed to something like _Base64_, still gives a pretty good idea of what the original string looked like. + +Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way also start with the number of bytes to read, but that number is prefixed with a `0`. As an example, the function: + +```rust +mod gödel { + mod escher { + fn bach() {} + } +} +``` + +would be mangled as: + +``` +_RN15mycrate_4a3b56d08gdel_5qa6escher4bachFE` + <--------> + unicode component +``` + +### Compression/Substitution + +The length of symbol names has an influence on how much work compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and `impl Trait`). For example, the return type of the following function: + +```rust +fn quux(s: Vec) -> impl Iterator { + s.into_iter() + .map(|x| x+1) + .filter(|&x| x > 10) + .zip(0..) + .chain(iter::once((0, 0))) +} +``` + +is + +```rust +std::iter::Chain< + std::iter::Zip< + std::iter::Filter< + std::iter::Map< + std::vec::IntoIter, + [closure@src/main.rs:16:11: 16:18]>, + [closure@src/main.rs:17:14: 17:25]>, + std::ops::RangeFrom>, + std::iter::Once<(u32, usize)>> +``` + +It would make for a symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach. + +The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of qualified names (including the entire name itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept: + +``` + + std::iter::Chain, std::vec::IntoIter>> +$0: --- +$1: --------- +$2: ---------------- +$3: -------------- +$4: -------- +$5: ------------------ +$6: ----------------------- +$7: ---------------------------------------------------------------- +$8: ---------------------------------------------------------------------------------- +``` + +The indices on the left are the dictionary keys. The prefixes `std`, `std::iter`, and `std::iter::Chain` all get added to the dictionary because we have not seen them before. After that we encounter `std` again. We've already seen it, so we don't add anything to the dictionary. The same goes for when we encounter `std::iter` the second time. Next we encounter `std::iter::Zip`, which we have not seen before, so it's added to the dictionary. Next we encounter `std` again (already seen, no insertion), then `std::vec` and `std::vec::IntoIter` which both generate a new entry. Next we see `std::vec::IntoIter`, the first full _type_. It generates an entry too. The second type parameter is the same as the first. No part of it introduces a new entry. After the next `>` we have completely processed `std::iter::Zip, std::vec::IntoIter>`, which adds another type entry. Finally, the full `std::iter::Chain, std::vec::IntoIter>>` adds another entry. + +Using the dictionary above, we can compress to: + +``` +std::iter::Chain<$1::Zip<$0::vec::IntoIter, $6>> +``` + +A couple of things to note: + + - The first occurrence of a dictionary entry is never substituted. We don't store the dictionary anywhere and need to be able to reconstruct it from the compressed version. + - Longer substitutions are preferred to shorter ones. `std::iter::Chain<$1::Zip<$0::vec::IntoIter, $4::IntoIter>>` would also decompress to the original version but the compiler is supposed to always pick the longest substitution available. + +The mangled version of a substitution marker is `S _` (and `S_` for key `0`) like it in Itanium mangling. So the above definition would be mangled to: + +``` +_RN12std_a1b2c3d44iter5ChainINS0_3ZipINS_3vec8IntoIterIjEES5_EEE +``` + +The uncompressed version would be: +``` +_RN12std_a1b2c3d44iter5ChainIN12std_a1b2c3d44iter3ZipIN12std_a1b2c3d43vec8IntoIterIjEEN12std_a1b2c3d43vec8IntoIterIjEEEEE +``` + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +This is the technical portion of the RFC. Explain the design in sufficient detail that: + +- Its interaction with other features is clear. +- It is reasonably clear how the feature would be implemented. +- Corner cases are dissected by example. + +The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. + +# Drawbacks +[drawbacks]: #drawbacks + +Why should we *not* do this? + +- The scheme is rather complex, especially due to compression (albeit not more complex than prior art) +- The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current scheme mangling is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information. + + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +The alternatives considered are: + + - Keeping the current scheme. It does meet the minimum requirements after all. It also has pretty big downsides. + - Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safe guard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. + - Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long. + - Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the Itanium inspired scheme proposed above. + +The Itanium mangling (and by extension the scheme proposed here) could be considered somewhat arcane. But it is well-known from C++ and provides a good trade-off between readability, complexity, and length of generated symbols. + +# Prior art +[prior-art]: #prior-art + +The mangling scheme described here is an adaptation of the [Itanium C++ ABI][itanium-mangling] scheme, +which is the scheme used by the GCC toolchain (and clang when it's not compiling for MSVC). In fact, +the scheme proposed here tries to stay as close as possible to Itanium mangling and only deviates +where something does not make sense for Rust. + +One notable improvement the proposed scheme makes upon Itanium mangling is explicit handling of +unicode identifiers. The idea of using [Punycode][punycode] for this is taken from the +[Swift][swift-gh] programming language's [mangling scheme][swift-mangling] (which is also based on +Itanium mangling). + + +[punycode]: https://tools.ietf.org/html/rfc3492 +[itanium-mangling]: http://refspecs.linuxbase.org/cxxabi-1.86.html#mangling +[swift-gh]: https://github.com/apple/swift +[swift-mangling]: https://github.com/apple/swift/blob/master/docs/ABI/Mangling.rst#identifiers + + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- Should we introduce a `St` substitution for the `::std::` to the compression scheme (like Itanium does). This would preclude mixing symbols from different versions of the standard library into a single binary (we'd not have a crate disambiguator for libstd). It's unclear whether that can occur in practice. +- Similar to the above, common items, like built-in bounds, could get predefined abbreviations. +- Is the compression scheme unambiguous? That is, is it always clear which substitutions the compiler should choose? (a reference implementation of the algorithm will solve this) +- Is the scheme for disambiguating specialized impls sound? +- Should symbols include information that might help during debugging/analyzing a program but that is not strictly necessary for avoiding name conflicts? Examples of such information would be names and types of function parameters or the ABI of functions. +- Should named items (everything of the form `N...E`) *not* start with `N` but instead with something that gives a hint of what it is? E.g. `F` for functions, `S` for statics, `C` for closures, etc? This is not needed for disambiguation but it would add more information to the symbol name without really increasing the complexity of the scheme or the length of names. (Although it makes compression a bit less straightforward to describe.) + +# Appendix - Interesting Examples + +TODO + - specializing impls + - impl Trait + - closure environment as a type parameter + - various examples of compression From 61045094034e783b8a2af750e56fea14ee925ab0 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Wed, 21 Nov 2018 18:12:15 +0100 Subject: [PATCH 02/18] First update after reference implementation is done. --- text/0000-symbol-name-mangling-v2.md | 327 +++++++++++++++------------ 1 file changed, 183 insertions(+), 144 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index a1db3e51de8..63106421153 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -8,6 +8,8 @@ This RFC proposes a new mangling scheme that describes what the symbol names for everything generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling]. +Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable ABI for Rust code. In the future it could be part of both and it is designed to be stable and extensible but for the time being it would still be an implementation detail of the Rust compiler. + # Motivation [motivation]: #motivation @@ -28,7 +30,7 @@ The proposed scheme solves these problems: - It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time. - It encodes information about generic parameters in a reversible way. - It has a consistent definition that does not rely on pretty-printing certain language constructs. -- It generates symbols that only consist of the characters `A-Z`, `a-z`, `0-9`, `_`, and `$`. +- It generates symbols that only consist of the characters `A-Z`, `a-z`, `0-9`, and `_`. This should make it easier for third party tools to work with Rust binaries. @@ -49,11 +51,16 @@ A symbol mangling scheme has a few goals, one of them essential, the rest of the - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name even for simple cases. - - A mangling should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`, and `$`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). + - A mangling should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). - The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for compiler and linker) and that generating a symbol should not be too computationally expensive. -Note that a source-level definition can contain components that will not show up in symbol names, like lifetimes (as in `fn foo<'a>()`). It is an explicit non-goal of this RFC to define a mangling for cases like the above. One might want to cover them "for completeness" but they are not actually needed. +The RFC also has a couple of non-goals: + + - Source-level definitions can contain components that will not show up in symbol names, like lifetimes (as in `fn foo<'a>()`). This RFC does not define a mangling for cases like the above. One might want to cover them "for completeness" but they are not actually needed. + + - The mangling scheme does not try to be compatible with an existing C++ mangling scheme. While it might sound tempting to encode Rust symbols with an existing scheme, it is the author's opinion that the actual benefits are small (C++ tools would not demangle to Rust syntax, demanglings would be hard to read) and at the same time supporting a Rust-specific scheme in existing tools seems quite feasible (many tools like GDB, LLDB, binutils, and valgrind already have specialized code paths for Rust symbols). + ## The Mangling Scheme by Example @@ -70,11 +77,11 @@ mod foo { } ``` -has the path `foo::bar` and `N3foo3barE` is a mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is the encoding that the [Itanium C++ ABI][itanium-mangling] name mangling scheme uses for "nested names" (i.e. paths). The scheme proposed here will also use this format. +has the path `foo::bar` and `N3foo3barE` is a mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is the encoding that the [Itanium C++ ABI][itanium-mangling] name mangling scheme uses for "nested names" (i.e. paths). The scheme proposed here will also use this format because it does not need termination tokens for identifiers (which are hard to come by with our limited character set). However, the symbol name above does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define `mod foo { fn bar() {} }` somewhere. So in order to avoid conflicts in such cases, fully qualified names always include the crate name and disambiguator, as in `N15mycrate_4a3b56d3foo3barE` (the crate disambiguator is used to disambiguate different versions of the same crate. It is an existing concept and not introduced by this RFC). -There is one more possible ambiguity that we have to take care of: Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form `crate_id::foo::bar` not uniquely identifying the item `bar` because the following snippet is legal Rust code: +There is another possible ambiguity that we have to take care of. Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form `crate_id::foo::bar` not uniquely identifying the item `bar` because the following snippet is legal Rust code: ```rust fn foo() { @@ -86,18 +93,29 @@ mod foo { } ``` -The function `foo` lives in the value namespaces while the module `foo` lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct `bar` functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name `N15mycrate_4a3b56d3fooF3barFE` and case two get the name `N15mycrate_4a3b56d3foo3barFE` (notice the difference: `3fooF` vs `3foo`). +The function `foo` lives in the value namespaces while the module `foo` lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct `bar` functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name `N15mycrate_4a3b56d3fooV3barVE` and case two get the name `N15mycrate_4a3b56d3foo3barVE` (notice the difference: `3fooV` vs `3foo`). + +There is on final case of name ambiguity that we have to take care of. Because of macro hygiene multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code: + +```rust +mod foo { + fn bar() {} + // The second `bar` function was introduce by macro expansion. + fn bar*() {} +} +``` +Then we would encode the two functions symbols as `N15mycrate_4a3b56d3foo3barVE` and `N15mycrate_4a3b56d3foo3barVs_E` respectively (note the `s_` suffix). The details on the shape of this suffix are provided in the reference-level description. -As opposed to C++ and other languages that support function overloading, we don't need to include the argument types in the symbol name. Rust does not allow two functions of the same name but different arguments. +As opposed to C++ and other languages that support function overloading, we don't need to include function parameter types in the symbol name. Rust does not allow two functions of the same name but different arguments. The final symbol name for the function would also include the prefix `_R` that is common to all symbol names generated by this scheme: ``` - _RN15mycrate_4a3b56d3foo3barFE` + _RN15mycrate_4a3b56d3foo3barVE` <><--------------------------> | | -prefix fully qualified name +prefix absolute path ``` @@ -113,7 +131,7 @@ std::mem::align_of:: would be mangled to ``` -_RN12std_a1b2c3d43mem8align_ofFIdEE +_RN12std_a1b2c3d43mem8align_ofVIdEE ^^^ ||| start of argument list ---+|+--- end of argument list @@ -130,29 +148,29 @@ where `I` starts the list of arguments, `d` designates `f64` and `E` ends the ar - structs, enums, closures, and other named types, possibly with their own set of type arguments - function types such as `fn(&i32) -> u16` -Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their fully qualified name (plus arguments) like is done for function symbols. Composites like references, tuples, and function types all follow simple grammar given in the reference-level explanation below. Here are some examples manglings to get a general feel of what they look like: +Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their fully qualified name (plus arguments) like is done for function symbols. Composites like references, tuples, and function types all follow simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like: - - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofFIjEE` - - `std::mem::align_of::<&char>`: `_RN12std_a1b2c3d43mem8align_ofFIRcEE` - - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofFIN12std_a1b2c3d43mem12DiscriminantEEE` - - `std::mem::align_of::<&mut (&str,())>`: `_RN12std_a1b2c3d43mem8align_ofFIWTRrvEEE` + - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofVIjEE` + - `std::mem::align_of::<&char>`: `_RN12std_a1b2c3d43mem8align_ofVIRcEE` + - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofVIN12std_a1b2c3d43mem12DiscriminantEEE` + - `std::mem::align_of::<&mut (&str,())>`: `_RN12std_a1b2c3d43mem8align_ofVIWTRrvEEE` -There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function `foo` which gets used as `foo` in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an `$` suffix to the symbol. So for example the mangling for `std::mem::align_of::` would actually look like this: +There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function `foo` which gets used as `foo` in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an `` suffix to the symbol. So for example the mangling for `std::mem::align_of::` would actually look like this: ``` -_RN12std_a1b2c3d43mem8align_ofFIjEE$foo_a1b2c3d4 (for crate "foo/a1b2c3d4") -_RN12std_a1b2c3d43mem8align_ofFIjEE$bar_11223344 (for crate "bar/11223344") +_RN12std_a1b2c3d43mem8align_ofVIjEE12foo_a1b2c3d4 (for crate "foo/a1b2c3d4") +_RN12std_a1b2c3d43mem8align_ofVIjEE12bar_11223344 (for crate "bar/11223344") ``` ### Closures and Closure Environments -The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme takes a simple approach here: Each closure gets assigned an index (unique within the item defining it) and from that we generate a name of the form `c$`. The `$` makes sure that the name cannot clash with user-defined names. The full name of a closure is then constructed like for any other named item: +The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme proposes to use the namespace and disambiguation mechanisms already introduced above for this purpose. Closures get their own "namespace" (i.e. they are neither in the type nor the value namespace), and each closure has an empty name with a disambiguation index (like for macro hygiene) identifying them within their parent. The full name of a closure is then constructed like for any other named item: ```rust mod foo { fn bar(x: u32) { - let a = |x| { x + 1 }; // ~ c$0 - let b = |x| { x + 2 }; // ~ c$1 + let a = |x| { x + 1 }; // ~ 0C + let b = |x| { x + 2 }; // ~ 0Cs_ a(b(x)) } @@ -160,97 +178,43 @@ mod foo { ``` -In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `c$0` and the second one the name `c$1`. Their full names would then be `N15mycrate_4a3b56d3foo3barF3c$0FE` and `N15mycrate_4a3b56d3foo3barF3c$1FE` respectively. The type of their environment would be the same, except for not having the `F` suffix to their local name. - -### Inherent Methods - -Inherent methods (that is, methods that are not part of a trait implementation) are represented by a symbol of the form: - -``` -_RM [] [] -``` - -The `M` designates the symbol as an inherent method. The self-type is encoded like any other type argument and already contains the concrete type arguments of the `impl` defining the method. The method name is unique among all inherent methods for the given type, so we don't need to further qualify it. The method can have type arguments of its own. These are encoded like other argument lists as `I + E`. If the method is generic in any way, it will also need the instantiating crate suffix, like any other generic item. - -Here's an example for a non-generic method: - -```rust -mod foo { - struct Bar; - - impl Bar { - pub fn panic_please() { panic!() } - } -} -``` - -The resulting symbol name looks like: - -``` -_RMN15mycrate_4a3b56d3foo3BarE12panic_please - - <-------------------------><------------> - self-type method name -``` +In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `0C` and the second one the name `0Cs_`. The `0` signifies then length of their (empty) name. The `C` is the namespace tag, analogous to the `V` tag for the value namespace. The `s_` for the second closure is the disambiguation index (index `0` is, again, encoded by not appending a suffix). Their full names would then be `N15mycrate_4a3b56d3foo3barV0CE` and `N15mycrate_4a3b56d3foo3barV0Cs_E` respectively. -A method with a generic self-type is a bit longer, since it also contains the instantiating-crate-suffix: +### Methods -```rust -mod foo { - struct Bar; +Methods are nested within `impl` or `trait` items. As such it would be possible construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::::some_method`. - impl Bar { - pub fn panic_please() { panic!() } - } -} -``` +However, in the opinion of the author these symbols are very hard to map back to the method they represent. Consider a module containing dozens of types, each with multiple `impl` blocks generated via `#[derive(...)]`. In order to find out what which method a symbol belongs to, one would have to count the number of handwritten and macro generated `impl` blocks in the module, and hope that one correctly guessed the number of `impl` blocks introduced by the given derive-macro (each macro invocation can introduce `0..n` such blocks). The name of the method might give a hint, but there are still likely to be dozens of methods named `clone`, `hash`, `eq`, et cetera. -The symbol for `foo::Bar::panic_please` would look like this: +The RFC therefore proposes to keep symbol names close to how methods are represented in error messages, that is: -``` -_RMN15mycrate_4a3b56d3foo3BarIcEE12panic_please$downstream_crate_x_abcd1234 - <----------------------------><------------><--------------------------> - self-type method name instantiating crate -``` +- `Foo::some_method` for inherent methods, and +- ` as SomeTrait>::some_method` for trait methods. +This can be achieved by extending the definition of absolute paths that we have used so far. Instead of the path root always being a crate-id, we now also allow a path to start with a single type (i.e. the self-type of an inherent method) or with a pair of self-type and the trait being implemented. The kind of root is indicated by the first character of the `N` starting the path: -### Trait Methods +- a decimal digit signifies a path with a crate-id root (since crate-ids always start with a digit), +- an `M` signifies a path with a single type as its root, and +- an `X` signifies a path with a self-type/trait pair as its root. -Trait methods are similar to inherent methods, but in addition to the self-type the symbol name must also contain the trait being implemented: +Thus, this extend form of paths would have the following syntax: ``` -_RX [] [] -``` - -The `X` signifies that this is a trait method. The trait being implemented is encoded `N + [I+ E] E`, like a named type. Here is a complex example with generics in all the places: - -```rust -mod foo { - trait Foo { - fn id(x: T) -> T; - } -} + := N * [I E] E -mod bar { - struct Bar; -} - -mod baz { - impl Foo for Bar { - fn id(x: T) -> T { x } - } -} + := + | M + | X ``` -The mangling for ` as Foo>::id::` would be: +Here are some examples for complete symbol names: ``` -_RXN15mycrate_4a3b56d3foo3FooIiEEN15mycrate_4a3b56d3bar3BarIcEE2idIjE$downstream_crate_x_abcd1234 - <----------------------------><----------------------------><-><-><--------------------------> - trait self-type method instantitating crate +::foo => _RNXmN12mycrate_abcd3FooE3fooVE +mycrate::Foo::foo => _RNMN12mycrate_abcd3FooImEE3fooVE + as mycrate::Bar>::foo => _RNXN12mycrate_abcd3FooImEEN12mycrate_abcd3BarIyEE3fooVE ``` -One thing that's interesting here is that `baz`, the module the impl is situated in, does not show up anywhere in the mangled name. ### Items Within Specialized Trait Impls @@ -273,7 +237,7 @@ The `X` here (or any other such nested definition) does not inherit the generic _RN15mycrate_4a3b56d3fooF1XFE ``` -However, there is at least one case where the type arguments *do* matter for a defintion like this, and that is when impl specialization is used. Consider the following piece of code: +However, there is at least one case where the type arguments *do* matter for a definition like this, and that is when trait specialization is used. Consider the following piece of code: ``` trait Foo { @@ -299,48 +263,25 @@ impl Foo for Bar { ``` -Notice that both `MSG` statics have the path `::foo::MSG` if you just leave off the type arguments. However, we also don't have any concrete types to substitute the arguments for. Therefore, we have to encode the type parameters and their bounds for cases like this so that the symbol name will be a mangling of something like ` as Foo>::foo::MSG where T: Clone`: - -``` -_RXI1TIN12std_a1b2c3d47default7DefaultEEEN15mycrate_4a3b56d3FooI1TEEN15mycrate_4a3b56d3BarI1TEE3foo3MSG - - -------------------------------------- "where clause" - -- T - ---------------------------------- bounds to T - -------------------------------- std::default::Default - --------------------------- Foo - --------------------------- Bar - foo ---- - MSG ---- - -``` - - -### Compiler-generated Items (Drop-Glue, Shims, etc) - -The compiler generates a number of things that can end up needing an entry in the symbol table: - - - Drop-glue is what recursively calls `Drop::drop()` for components of composite type. Generating symbol names for it is straightforward. They are of the form `_RG` where `` is the usual mangling as used for generic arguments. - - - Various "shims", that is, compiler-generated implementations of built-in traits like `Fn`, `FnMut`, `FnOnce`, or `Clone`, or for dynamic dispatch via trait objects. These are similar in structure to drop glue. Their precise mangling is specified in the reference-level explanation below. +Notice that both `MSG` statics have the path `::foo::MSG` if you just leave off the type arguments. However, we also don't have any concrete types to substitute the arguments for. Therefore, we have to disambiguate the `impls`. Since trait specialization is an unstable feature of Rust and the details are in flux, this RFC does not try to provide a mangling based on the `where` clauses of the specialized `impls`. Instead it proposes a scheme that re-uses the introduced numeric disambiguator form already used for macro hygiene and closures. Thus, conflicting `impls` would be disambiguated via an implementation defined suffix, as in `'1::foo::MSG` and `'2::foo::MSG`. This encoding introduces minimal additional syntax and can be replaced with something more human-readable once the definition of trait specialization is final. ### Unicode Identifiers -Rust allows unicode identifiers but our character set is restricted to ASCII alphanumerics, `_`, and `$`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string +Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and `_`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ascii identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string -```rust +``` "Gödel, Escher, Bach" ``` is encoded as -```rust +``` "Gdel, Escher, Bach-d3b" ``` which, as opposed to something like _Base64_, still gives a pretty good idea of what the original string looked like. -Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way also start with the number of bytes to read, but that number is prefixed with a `0`. As an example, the function: +Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way are augmented with a `u` suffix so that demanglers know that the identifier needs further decoding. As an example, the function: ```rust mod gödel { @@ -353,9 +294,9 @@ mod gödel { would be mangled as: ``` -_RN15mycrate_4a3b56d08gdel_5qa6escher4bachFE` +_RN15mycrate_4a3b56d8gdel_5qau6escher4bachVE` <--------> - unicode component + Unicode component ``` ### Compression/Substitution @@ -386,9 +327,9 @@ std::iter::Chain< std::iter::Once<(u32, usize)>> ``` -It would make for a symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach. +It would make for a long symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach. -The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of qualified names (including the entire name itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept: +The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of absolute paths (including the entire path itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept: ``` @@ -417,7 +358,7 @@ A couple of things to note: - The first occurrence of a dictionary entry is never substituted. We don't store the dictionary anywhere and need to be able to reconstruct it from the compressed version. - Longer substitutions are preferred to shorter ones. `std::iter::Chain<$1::Zip<$0::vec::IntoIter, $4::IntoIter>>` would also decompress to the original version but the compiler is supposed to always pick the longest substitution available. -The mangled version of a substitution marker is `S _` (and `S_` for key `0`) like it in Itanium mangling. So the above definition would be mangled to: +The mangled version of a substitution marker is `S _` (and `S_` for key `0`) like in the Itanium mangling. So the above definition would be mangled to: ``` _RN12std_a1b2c3d44iter5ChainINS0_3ZipINS_3vec8IntoIterIjEES5_EEE @@ -431,21 +372,124 @@ _RN12std_a1b2c3d44iter5ChainIN12std_a1b2c3d44iter3ZipIN12std_a1b2c3d43vec8IntoIt # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -This is the technical portion of the RFC. Explain the design in sufficient detail that: - -- Its interaction with other features is clear. -- It is reasonably clear how the feature would be implemented. -- Corner cases are dissected by example. +The reference-level explanation consists of three parts: + +1. A specification of the syntax mangled names conform to. +2. A specification of the compression scheme. +3. A mapping of Rust entities to the mangling syntax. + +For implementing a demangler, only the first to sections are needed, that is, a +demangler only needs to understand syntax and compression of names, but it does +not have to care how the compiler generates mangled names. + + +## Syntax Of Mangled Names + +The syntax of mangled names is given in extended Backus-Naur form: + + - Non-terminals are within angle brackets (as in ``) + - Terminals are within quotes (as in `"_R"`), + - Optional parts are in brackets (as in `[]`), + - Repetition (zero or more times) is signified by curly braces (as in `{ }`) + + +``` +// The specifies the encoding version. + := "_R" [] [] + + := "N" [] "E" + | + + := + | + | + + := + | "M" + | "X" [] + +// The is the length of the identifier in bytes. +// is must not start with a decimal digit. +// If the "u" is present then is Punycode-encoded. + := ["u"] ["V"|"C"] [] + + := + | // named type + | "A" [] // [T; N] + | "T" {} "E" // (T1, T2, T3, ...) + | "R" // &T + | "Q" // &mut T + | "P" // *const T + | "O" // *mut T + | "G" "E" // generic parameter name + | + | + + := "a" // i8 + | "b" // bool + | "c" // char + | "d" // f64 + | "e" // str + | "f" // f32 + | "h" // u8 + | "i" // isize + | "j" // usize + | "l" // i32 + | "m" // u32 + | "n" // i128 + | "o" // u128 + | "s" // i16 + | "t" // u16 + | "u" // () + | "v" // ... + | "x" // i64 + | "y" // u64 + | "z" // ! + +// If the "U" is present then the function is `unsafe`. +// If the "J" is present then it is followed by the return type of the function. + := "F" ["U"] [] {} ["J" ] "E" + + := "K" ( + "d" | // Cdecl + "s" | // Stdcall + "f" | // Fastcall + "v" | // Vectorcall + "t" | // Thiscall + "a" | // Aapcs + "w" | // Win64 + "x" | // SysV64 + "k" | // PtxKernel + "m" | // Msp430Interrupt + "i" | // X86Interrupt + "g" | // AmdGpuKernel + "c" | // C + "x" | // System + "r" | // RustCall + "j" | // RustInstrinsic + "p" | // PlatformInstrinsic + "u" // Unadjusted + ) + + := "s" [] "_" + + := "I" {} "E" + + := "S" [] "_" + +// We use here, so that we don't have to add a special for +// compression. In practice, only crate-id is expected. + := +``` -The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. # Drawbacks [drawbacks]: #drawbacks Why should we *not* do this? -- The scheme is rather complex, especially due to compression (albeit not more complex than prior art) -- The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current scheme mangling is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information. +- The scheme is rather complex, especially due to compression (albeit less complex than prior art) +- The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current mangling scheme is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information. # Rationale and alternatives @@ -454,9 +498,10 @@ Why should we *not* do this? The alternatives considered are: - Keeping the current scheme. It does meet the minimum requirements after all. It also has pretty big downsides. - - Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safe guard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. + - Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. - Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long. - Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the Itanium inspired scheme proposed above. + - Define a standardized pretty-printing format for things that end up as symbols, compress with zstd (specially trained for Rust symbols) and encode the result as base63. This is rather simple but loses all human-readability. It's unclear how well this would compress. It would pull the zstd specification into the mangling scheme specification, as well as the pre-training dictionary. The Itanium mangling (and by extension the scheme proposed here) could be considered somewhat arcane. But it is well-known from C++ and provides a good trade-off between readability, complexity, and length of generated symbols. @@ -483,12 +528,6 @@ Itanium mangling). # Unresolved questions [unresolved-questions]: #unresolved-questions -- Should we introduce a `St` substitution for the `::std::` to the compression scheme (like Itanium does). This would preclude mixing symbols from different versions of the standard library into a single binary (we'd not have a crate disambiguator for libstd). It's unclear whether that can occur in practice. -- Similar to the above, common items, like built-in bounds, could get predefined abbreviations. -- Is the compression scheme unambiguous? That is, is it always clear which substitutions the compiler should choose? (a reference implementation of the algorithm will solve this) -- Is the scheme for disambiguating specialized impls sound? -- Should symbols include information that might help during debugging/analyzing a program but that is not strictly necessary for avoiding name conflicts? Examples of such information would be names and types of function parameters or the ABI of functions. -- Should named items (everything of the form `N...E`) *not* start with `N` but instead with something that gives a hint of what it is? E.g. `F` for functions, `S` for statics, `C` for closures, etc? This is not needed for disambiguation but it would add more information to the symbol name without really increasing the complexity of the scheme or the length of names. (Although it makes compression a bit less straightforward to describe.) # Appendix - Interesting Examples From 0d632ca45add0afcee74ec2d0d15c7db68393123 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Thu, 22 Nov 2018 17:50:06 +0100 Subject: [PATCH 03/18] Add sections on Punycode identifiers, compression, and suggested demangling format. --- text/0000-symbol-name-mangling-v2.md | 74 +++++++++++++++++++++++++++- 1 file changed, 73 insertions(+), 1 deletion(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 63106421153..3c27d802e14 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -482,6 +482,63 @@ The syntax of mangled names is given in extended Backus-Naur form: := ``` +### Punycode Identifiers + +Punycode generates strings of the form `([[:ascii:]]+-)?[[:alnum:]]+`. This is problematic for two reasons: + +- Generated strings can contain a `-` character; which is not in the supported character set. +- Generated strings can start with a digit; which makes them clash with the byte-count prefix of the `` production. + +For these reasons, vanilla Punycode string are further encoded during mangling: + +- The `-` character is simply replaced by a `_` character. +- The part of the Punycode string that encodes the non-ASCII characters is a base-36 number, using `[a-z0-9]` as its "digits". We want to get rid of the decimal digits in there, so we simply remap `0-9` to `A-J`. + +Here are some examples: + +| Original | Punycode | Punycode + Encoding | +|-----------------|-----------------|---------------------| +| føø | f-5gaa | f_Fgaa | +| α_ω | _-ylb7e | __ylbHe | +| 铁锈 | n84amf | nIEamf | +| 🤦 | fq9h | fqJh | +| ρυστ | 2xaedc | Cxaedc | + +With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling. + + +## Compression + +The compression algorithm is defined in terms of the AST: Starting at the root, recursively substitute each child node with its compressed version. A node is compressed by replacing it with a `` node from the dictionary (which the dictionary will contain if an *equivalent* node has already been encountered) or, if the dictionary doesn't contain a matching substitution, recursively apply compression to all child nodes and then add the current node to the dictionary. + +Things to note: + +- Child nodes have to be compressed in the same order in which they lexically occur in the mangled name. Processing order matters because it defines which substitution indices are allocated for which node. + +- Nodes are "equivalent" if they result in the *same demangling*. Usually that means that equivalence can be tested by just comparing the sub-tree that the nodes are roots of. However, there are some *additional* equivalences that have to be considered when doing a dictionary lookup: + + - A `` node is equivalent to its `` child node if its `` child node is empty. + + - A `` node of the from `M ` is equivalent to its `` child node. + + - A `` node with a single `` child is equivalent to this child node. + +All productions that have a `` on their right-hand side are added to the substitution dictionary: ``, ``, and ``. The only exception are `` nodes that are a ``. Those are not added to the dictionary. Also, if there is a node `X` and there already is an equivalent node `Y` in the dictionary, `X` is not added either. For example, we don't add `` nodes with empty `` to the dictionary because it always already contains the `` child node equivalent to its parent ``. + + +TODO: add pseudo code implementation? + +## Decompression + + +### Note on Efficient Demangling + + +## Mapping Rust Items to Mangled Names + + + + # Drawbacks [drawbacks]: #drawbacks @@ -528,8 +585,23 @@ Itanium mangling). # Unresolved questions [unresolved-questions]: #unresolved-questions +# Appendix A - Suggested Demangling + +This RFC suggests that names are demangling to a form that matches Rust syntax as it is used in source code and compiler error messages: + +- Path components should be separated by `::`. + +- If the path root is a `` it should be printed as the crate name. If the context requires it for correctness, the crate disambiguator should be printed too, as in, for example, `std[a0b1c2d3]::collections::HashMap`. In this case `a0b1c2d3` would be the disambiguator. Usually, the disambiguator can be omitted for better readability. + +- If the path root is a trait impl, it should be printed as ``, like the compiler does in error messages. + +- The list of generic arguments should be demangled as ``. + +- Identifiers and trait impl path roots can have a numeric disambiguator (the `` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s_` then the index is ` + 2`. The suggested demangling of a disambiguator is `'`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always emitted. + The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}'`. + -# Appendix - Interesting Examples +# Appendix B - Interesting Examples TODO - specializing impls From bad4c9091c6717ac4562961019f16f0d824a2847 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Mon, 26 Nov 2018 12:19:43 +0100 Subject: [PATCH 04/18] Update symbol syntax. --- text/0000-symbol-name-mangling-v2.md | 102 +++++++++++++-------------- 1 file changed, 51 insertions(+), 51 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 3c27d802e14..5e8a0449831 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -391,66 +391,66 @@ The syntax of mangled names is given in extended Backus-Naur form: - Terminals are within quotes (as in `"_R"`), - Optional parts are in brackets (as in `[]`), - Repetition (zero or more times) is signified by curly braces (as in `{ }`) + - Comments are marked with `//`. +Mangled names conform to the following grammar: ``` // The specifies the encoding version. - := "_R" [] [] + = "_R" [] [] - := "N" [] "E" - | + = "N" [] "E" + | - := - | - | - - := - | "M" - | "X" [] + = + | "M" + | "X" [] + | + | // The is the length of the identifier in bytes. // is must not start with a decimal digit. // If the "u" is present then is Punycode-encoded. - := ["u"] ["V"|"C"] [] - - := - | // named type - | "A" [] // [T; N] - | "T" {} "E" // (T1, T2, T3, ...) - | "R" // &T - | "Q" // &mut T - | "P" // *const T - | "O" // *mut T - | "G" "E" // generic parameter name - | - | - - := "a" // i8 - | "b" // bool - | "c" // char - | "d" // f64 - | "e" // str - | "f" // f32 - | "h" // u8 - | "i" // isize - | "j" // usize - | "l" // i32 - | "m" // u32 - | "n" // i128 - | "o" // u128 - | "s" // i16 - | "t" // u16 - | "u" // () - | "v" // ... - | "x" // i64 - | "y" // u64 - | "z" // ! + = ["u"] ["V"|"C"] [] + + = + | // named type + | "A" [] // [T; N] + | "T" {} "E" // (T1, T2, T3, ...) + | "R" // &T + | "Q" // &mut T + | "P" // *const T + | "O" // *mut T + | "G" "E" // generic parameter name + | + | + + = "a" // i8 + | "b" // bool + | "c" // char + | "d" // f64 + | "e" // str + | "f" // f32 + | "h" // u8 + | "i" // isize + | "j" // usize + | "l" // i32 + | "m" // u32 + | "n" // i128 + | "o" // u128 + | "s" // i16 + | "t" // u16 + | "u" // () + | "v" // ... + | "x" // i64 + | "y" // u64 + | "z" // ! // If the "U" is present then the function is `unsafe`. // If the "J" is present then it is followed by the return type of the function. := "F" ["U"] [] {} ["J" ] "E" - := "K" ( + = "K" ( "d" | // Cdecl "s" | // Stdcall "f" | // Fastcall @@ -469,16 +469,16 @@ The syntax of mangled names is given in extended Backus-Naur form: "j" | // RustInstrinsic "p" | // PlatformInstrinsic "u" // Unadjusted - ) + ) - := "s" [] "_" + = "s" [] "_" - := "I" {} "E" + = "I" {} "E" - := "S" [] "_" + = "S" [] "_" -// We use here, so that we don't have to add a special for -// compression. In practice, only crate-id is expected. +// We use here, so that we don't have to add a special rule for +// compression. In practice, only is expected. := ``` From 40037cfec2cef27ff9aba59be317dd2af15c488f Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Mon, 26 Nov 2018 12:20:07 +0100 Subject: [PATCH 05/18] Update description of compression algorithm. --- text/0000-symbol-name-mangling-v2.md | 72 ++++++++++++++++++++++++---- 1 file changed, 63 insertions(+), 9 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 5e8a0449831..fa5f5ade646 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -509,24 +509,78 @@ With this post-processing in place the Punycode strings can be treated like regu ## Compression -The compression algorithm is defined in terms of the AST: Starting at the root, recursively substitute each child node with its compressed version. A node is compressed by replacing it with a `` node from the dictionary (which the dictionary will contain if an *equivalent* node has already been encountered) or, if the dictionary doesn't contain a matching substitution, recursively apply compression to all child nodes and then add the current node to the dictionary. +From a high-level perspective symbol name compression works by replacing parts of the mangled name that have already been seen with a substitution marker identifying the already seen part. Which parts are eligible for substitution is defined via the AST of the name (as described in the previous section). Let's define some terms first: -Things to note: +- Two AST nodes are *equivalent* if they contain the same information. In general this means that two nodes are equivalent if the sub-trees they are the root of are equal. However, there is another condition that can make two nodes equivalent. If a node `N` has a single child node `C` and `N` does not itself add any new information, then `N` and `C` are equivalent too. The exhaustive list of these special cases is: + + - `` nodes without a `` child. These are equivalent to their `` child node. + + - `` nodes with a single `` child. These are equivalent to their child node. + + - `` nodes with a single `` child. These too are equivalent to their child node. + + Equivalence is transitive, so given, for example, an AST of the form + + ``` + + | + v + + | + v + + ``` + + then the `` node is equivalent to the `` node. + + - A *substitutable* AST node is any node with a `` on the right-hand side of the production. Thus the exhaustive list of substitutable node types is: ``, ``, and ``. There is one exception to this rule: nodes that are *equivalent* to a `` node, are not *substitutable*. + + - The "substitution dictionary" is a mapping from *substitutable* AST nodes to integer indices. + +Given these definitions, compression is defined as follows. + + - Initialize the substitution dictionary to be empty. + - Traverse and modify the AST as follows: + - When encountering a substitutable node `N` there are two cases + 1. If the substitution dictionary already contains an *equivalent* node, replace the current node `N` with a `` that encodes the substitution index taken from the dictionary. + 2. Else, continue traversing through the child nodes of the current node. After the child nodes have been traversed, and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its key. + +The following gives an example of substitution index assignment and node replacements for `foo::Bar::quux` (with `quux` being an inherent method of `foo::Bar`). `#n` designates that the substitution index `n` was assigned to the given node and `:= #n` designates that it is replaced with a ``: + + +``` + + | + #3 + / \ + #2 + / \ | + := #1 + / | + + | | + + | / \ + #1 + / \ / + #0 + | + +``` + +Some interesting things to note in this example: + + - There are substitutable nodes that are not replaced, nor added to the dictionary. This falls out of the equivalence rule. The node marked with `#1` is equivalent to its three immediate ancestors, so no dictionary entries are generated for those. + + - The `` node marked with `:= #1` is replaced by `#1`, which is not a `` but a (equivalent) ``. This is OK and prescribed by the algorithm. The definition of equivalence ensures that there is only one valid way to construct a `` node from a `` node. -- Child nodes have to be compressed in the same order in which they lexically occur in the mangled name. Processing order matters because it defines which substitution indices are allocated for which node. -- Nodes are "equivalent" if they result in the *same demangling*. Usually that means that equivalence can be tested by just comparing the sub-tree that the nodes are roots of. However, there are some *additional* equivalences that have to be considered when doing a dictionary lookup: - - A `` node is equivalent to its `` child node if its `` child node is empty. - - A `` node of the from `M ` is equivalent to its `` child node. - - A `` node with a single `` child is equivalent to this child node. -All productions that have a `` on their right-hand side are added to the substitution dictionary: ``, ``, and ``. The only exception are `` nodes that are a ``. Those are not added to the dictionary. Also, if there is a node `X` and there already is an equivalent node `Y` in the dictionary, `X` is not added either. For example, we don't add `` nodes with empty `` to the dictionary because it always already contains the `` child node equivalent to its parent ``. -TODO: add pseudo code implementation? ## Decompression From 836fc864526687d18c1996919d2f0602bbdc0413 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Mon, 26 Nov 2018 16:14:26 +0100 Subject: [PATCH 06/18] Switch to from hex to base-62 as the non-decimal encoding. --- text/0000-symbol-name-mangling-v2.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index fa5f5ade646..d63c7cc91d2 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -162,6 +162,7 @@ _RN12std_a1b2c3d43mem8align_ofVIjEE12foo_a1b2c3d4 (for crate "foo/a1b2c3d4") _RN12std_a1b2c3d43mem8align_ofVIjEE12bar_11223344 (for crate "bar/11223344") ``` + ### Closures and Closure Environments The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme proposes to use the namespace and disambiguation mechanisms already introduced above for this purpose. Closures get their own "namespace" (i.e. they are neither in the type nor the value namespace), and each closure has an empty name with a disambiguation index (like for macro hygiene) identifying them within their parent. The full name of a closure is then constructed like for any other named item: @@ -180,6 +181,7 @@ mod foo { In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `0C` and the second one the name `0Cs_`. The `0` signifies then length of their (empty) name. The `C` is the namespace tag, analogous to the `V` tag for the value namespace. The `s_` for the second closure is the disambiguation index (index `0` is, again, encoded by not appending a suffix). Their full names would then be `N15mycrate_4a3b56d3foo3barV0CE` and `N15mycrate_4a3b56d3foo3barV0Cs_E` respectively. + ### Methods Methods are nested within `impl` or `trait` items. As such it would be possible construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::::some_method`. @@ -265,6 +267,7 @@ impl Foo for Bar { Notice that both `MSG` statics have the path `::foo::MSG` if you just leave off the type arguments. However, we also don't have any concrete types to substitute the arguments for. Therefore, we have to disambiguate the `impls`. Since trait specialization is an unstable feature of Rust and the details are in flux, this RFC does not try to provide a mangling based on the `where` clauses of the specialized `impls`. Instead it proposes a scheme that re-uses the introduced numeric disambiguator form already used for macro hygiene and closures. Thus, conflicting `impls` would be disambiguated via an implementation defined suffix, as in `'1::foo::MSG` and `'2::foo::MSG`. This encoding introduces minimal additional syntax and can be replaced with something more human-readable once the definition of trait specialization is final. + ### Unicode Identifiers Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and `_`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ascii identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string @@ -378,9 +381,7 @@ The reference-level explanation consists of three parts: 2. A specification of the compression scheme. 3. A mapping of Rust entities to the mangling syntax. -For implementing a demangler, only the first to sections are needed, that is, a -demangler only needs to understand syntax and compression of names, but it does -not have to care how the compiler generates mangled names. +For implementing a demangler, only the first two sections are of interest, that is, a demangler only needs to understand syntax and compression of names, but it does not have to care about how the compiler generates mangled names. ## Syntax Of Mangled Names @@ -471,11 +472,11 @@ Mangled names conform to the following grammar: "u" // Unadjusted ) - = "s" [] "_" + = "s" [] "_" = "I" {} "E" - = "S" [] "_" + = "S" [] "_" // We use here, so that we don't have to add a special rule for // compression. In practice, only is expected. From 61c0317f8e4c4e2d46600204955dcefcfd727b00 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Mon, 26 Nov 2018 16:15:14 +0100 Subject: [PATCH 07/18] Add decompression and Rust entity to AST mapping. --- text/0000-symbol-name-mangling-v2.md | 47 ++++++++++++++++++++++++---- 1 file changed, 41 insertions(+), 6 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index d63c7cc91d2..2cd03d3831a 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -543,8 +543,8 @@ Given these definitions, compression is defined as follows. - Initialize the substitution dictionary to be empty. - Traverse and modify the AST as follows: - When encountering a substitutable node `N` there are two cases - 1. If the substitution dictionary already contains an *equivalent* node, replace the current node `N` with a `` that encodes the substitution index taken from the dictionary. - 2. Else, continue traversing through the child nodes of the current node. After the child nodes have been traversed, and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its key. + 1. If the substitution dictionary already contains an *equivalent* node, replace the children of `N` with a `` that encodes the substitution index taken from the dictionary. + 2. Else, continue traversing through the child nodes of `N`. After the child nodes have been traversed and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its key. The following gives an example of substitution index assignment and node replacements for `foo::Bar::quux` (with `quux` being an inherent method of `foo::Bar`). `#n` designates that the substitution index `n` was assigned to the given node and `:= #n` designates that it is replaced with a ``: @@ -573,26 +573,61 @@ Some interesting things to note in this example: - There are substitutable nodes that are not replaced, nor added to the dictionary. This falls out of the equivalence rule. The node marked with `#1` is equivalent to its three immediate ancestors, so no dictionary entries are generated for those. - - The `` node marked with `:= #1` is replaced by `#1`, which is not a `` but a (equivalent) ``. This is OK and prescribed by the algorithm. The definition of equivalence ensures that there is only one valid way to construct a `` node from a `` node. + - The `` node marked with `:= #1` is replaced by `#1`, which is not a `` but an (equivalent) ``. This is OK and prescribed by the algorithm. The definition of equivalence ensures that there is only one valid way to construct a `` node from a `` node. +## Decompression + +Decompression works analogously to compression: + - Initialize the substitution dictionary to be empty. + - Traverse and modify the AST as follows: + - When encountering a substitutable node `N` there are two cases + 1. If the node has a single `` child, extract the substitution index from it and replace the node with the corresponding entry from the substitution dictionary. + 2. Else, continue traversing the child nodes of the current node. After the child nodes have been traversed, and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its key. +This is what the example from above looks like for decompression: +``` + + | + #3 + / \ + #2 + / \ | + := #1 + / | + + | + + | + #1 + / \ + #0 + | + +``` +### A Note On Implementing Efficient Demangling +The mangling syntax is constructed in a way that allows for implementing an efficient demangler: + - Mangled names contain information in the same order as unmangled names are expected to contain it. Therefore, a demangler can directly generate its output while parsing the mangled form. There is no need to explicitly instantiate the AST in memory. -## Decompression + - The same is true for decompression. The demangler can keep a simple array that maps substitution indices to ranges in the already generated output. When it encounters a `` in need of expansion, it can just look up corresponding range and do a simple `memcpy`. +Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need to do dynamic allocation except for dictionary array. -### Note on Efficient Demangling +## Mapping Rust Language Entities to Symbol Names -## Mapping Rust Items to Mangled Names +This RFC suggests the following mapping of Rust entities to mangled names: +- Free standing named functions and types shall be represented by an `` production. +- Absolute paths should be rooted at the inner-most entity that can act as a path root. Roots can be crate-ids, types (for entities with an inherent impl in their path), and trait impls (for entities with trait impls in their path). +- The compiler is free to choose disambiguation indices for identifiers and trait impls that need disambiguation. The disambiguation index `0` is represented by omitting the `` production (which should be the common case). Disambiguation indices do not need to be densely packed. In particular the compiler can use arbitrary hashes to disambiguate items (which is useful for supporting specializing trait impls). # Drawbacks From faeca2626af3264f185c6d5d03110941c89114af Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Mon, 26 Nov 2018 16:15:42 +0100 Subject: [PATCH 08/18] Clean up and Appendix B. --- text/0000-symbol-name-mangling-v2.md | 109 +++++++++++++++++++++++---- 1 file changed, 96 insertions(+), 13 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 2cd03d3831a..62fbf445096 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -644,11 +644,15 @@ Why should we *not* do this? The alternatives considered are: - - Keeping the current scheme. It does meet the minimum requirements after all. It also has pretty big downsides. - - Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. - - Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long. - - Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the Itanium inspired scheme proposed above. - - Define a standardized pretty-printing format for things that end up as symbols, compress with zstd (specially trained for Rust symbols) and encode the result as base63. This is rather simple but loses all human-readability. It's unclear how well this would compress. It would pull the zstd specification into the mangling scheme specification, as well as the pre-training dictionary. + 1. Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that it confusing and leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers. + + 2. Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. + + 2. Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long. + + 3. Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the Itanium inspired scheme proposed above. + + 4. Define a standardized pretty-printing format for things that end up as symbols, compress with zstd (specially trained for Rust symbols) and encode the result as base63. This is rather simple but loses all human-readability. It's unclear how well this would compress. It would pull the zstd specification into the mangling scheme specification, as well as the pre-trained dictionary. The Itanium mangling (and by extension the scheme proposed here) could be considered somewhat arcane. But it is well-known from C++ and provides a good trade-off between readability, complexity, and length of generated symbols. @@ -677,7 +681,7 @@ Itanium mangling). # Appendix A - Suggested Demangling -This RFC suggests that names are demangling to a form that matches Rust syntax as it is used in source code and compiler error messages: +This RFC suggests that names are demangling to a form that matches Rust syntax as it is used in source code, compiler error messages and `rustdoc`: - Path components should be separated by `::`. @@ -687,14 +691,93 @@ This RFC suggests that names are demangling to a form that matches Rust syntax a - The list of generic arguments should be demangled as ``. -- Identifiers and trait impl path roots can have a numeric disambiguator (the `` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s_` then the index is ` + 2`. The suggested demangling of a disambiguator is `'`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always emitted. - The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}'`. +- Identifiers and trait impl path roots can have a numeric disambiguator (the `` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s_` then the index is ` + 2`. The suggested demangling of a disambiguator is `[]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always emitted. + + The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[]`. +- In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted. # Appendix B - Interesting Examples -TODO - - specializing impls - - impl Trait - - closure environment as a type parameter - - various examples of compression +We assume that all examples are defined in a crate named `mycrate[xxx]`. + + +### Free-standing Item + +```rust +mod foo { + mod bar { + fn baz() {} + } +} +``` +- unmangled: `mycrate::foo::bar::baz` +- mangled: `_RN3foo3bar3bazVE` + + +### Item Defined In Inherent Method + +```rust +struct Foo(T); + +impl Foo { + pub fn bar(_: U) { + static QUUX: u32 = 0; + // ... + } +} +``` +- unmangled: `mycrate::Foo::bar::QUUX` +- mangled: `_RNNM11mycrate_xxx3FooE3barV4QUUXVE` + + +### Item Defined In Trait Method + +```rust +struct Foo(T); + +impl Clone for Foo { + fn clone(_: U) { + static QUUX: u32 = 0; + // ... + } +} +``` +- unmangled: `::clone::QUUX` +- mangled: `_RNXN11mycrate_xxx3FooEN7std_yyy5clone5CloneE5cloneV4QUUXVE` + + +### Item Defined In Specializing Trait Impl +```rust +struct Foo(T); + +impl Clone for Foo { + default fn clone(_: U) { + static QUUX: u32 = 0; + // ... + } +} +``` +- unmangled: `[1234]::clone::QUUX` +- mangled: `_RNXN11mycrate_xxx3FooEN7std_yyy5clone5CloneEsjU_5cloneV4QUUXVE` + + +### Item Defined In Initializer Of A Static +```rust +pub static QUUX: u32 = { + static FOO: u32 = 1; + FOO + FOO +}; +``` +- unmangled: `mycrate::QUUX::FOO` +- mangled: `_RN11mycrate_xxx4QUUXV3FOOVE` + + +### Compressed Prefix Constructed From Prefix That Contains Substitution Itself +- unmangled: `std[xxx]::foo` +- mangled: `_RN7std_xxx3fooFINS_3barFENS1_3bazFEEE` + + +### Progressive type compression +- unmangled: `std[xxx]::foo<(std[xxx]::Bar,std[xxx]::Bar),(std[xxx]::Bar,std[xxx]::Bar)>` +- mangled: `_RN7std_xxx3fooITNS_3BarES1_ES2_EE` From 0fc3d010afe2d612e8b28ef52e2d0f188639bfda Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Tue, 27 Nov 2018 14:17:16 +0100 Subject: [PATCH 09/18] Corrections after some proof-reading. --- text/0000-symbol-name-mangling-v2.md | 124 +++++++++++++++------------ 1 file changed, 70 insertions(+), 54 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 62fbf445096..07458cbacc0 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -1,14 +1,14 @@ - Feature Name: symbol_name_mangling_v2 -- Start Date: 2018-10-01 +- Start Date: 2018-11-27 - RFC PR: (leave this empty) - Rust Issue: (leave this empty) # Summary [summary]: #summary -This RFC proposes a new mangling scheme that describes what the symbol names for everything generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling]. +This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling]. -Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable ABI for Rust code. In the future it could be part of both and it is designed to be stable and extensible but for the time being it would still be an implementation detail of the Rust compiler. +Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable Rust ABI. In the future it could be part of both and it is designed to be stable and extensible but for the time being it would still be an implementation detail of the Rust compiler. # Motivation [motivation]: #motivation @@ -18,7 +18,7 @@ number of drawbacks: - It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool. - Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name. -- The current scheme is inconsistent: most paths use Itanium style encoding, but some of them don't. +- The current scheme is inconsistent: most paths use Itanium style encoding, but some don't. - The symbol names it generates can contain `.` characters which is not generally supported on all platforms. \[[1][gas]\]\[[2][lld-windows-bug]\] \[[3][thin-lto-bug]\] [gas]: https://sourceware.org/binutils/docs/as/Symbol-Names.html#Symbol-Names @@ -49,15 +49,15 @@ A symbol mangling scheme has a few goals, one of them essential, the rest of the - A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix. - - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name even for simple cases. + - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases. - - A mangling should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). + - A mangling scheme should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). - - The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for compiler and linker) and that generating a symbol should not be too computationally expensive. + - The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for the compiler and the linker). In addition, generating or demangling a symbol name should not be too computationally expensive. The RFC also has a couple of non-goals: - - Source-level definitions can contain components that will not show up in symbol names, like lifetimes (as in `fn foo<'a>()`). This RFC does not define a mangling for cases like the above. One might want to cover them "for completeness" but they are not actually needed. + - Source-level definitions can contain components that will not show up in symbol names, like lifetimes (as in `fn foo<'a>()`). This RFC does not define a mangling for cases like these. One might want to cover them "for completeness" but they are not actually needed. - The mangling scheme does not try to be compatible with an existing C++ mangling scheme. While it might sound tempting to encode Rust symbols with an existing scheme, it is the author's opinion that the actual benefits are small (C++ tools would not demangle to Rust syntax, demanglings would be hard to read) and at the same time supporting a Rust-specific scheme in existing tools seems quite feasible (many tools like GDB, LLDB, binutils, and valgrind already have specialized code paths for Rust symbols). @@ -65,7 +65,7 @@ The RFC also has a couple of non-goals: ## The Mangling Scheme by Example -This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and see how that already involves things that might be surprising. +This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and will see how that already involves things that might be surprising. ### Free-standing Functions and Statics @@ -79,7 +79,7 @@ mod foo { has the path `foo::bar` and `N3foo3barE` is a mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is the encoding that the [Itanium C++ ABI][itanium-mangling] name mangling scheme uses for "nested names" (i.e. paths). The scheme proposed here will also use this format because it does not need termination tokens for identifiers (which are hard to come by with our limited character set). -However, the symbol name above does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define `mod foo { fn bar() {} }` somewhere. So in order to avoid conflicts in such cases, fully qualified names always include the crate name and disambiguator, as in `N15mycrate_4a3b56d3foo3barE` (the crate disambiguator is used to disambiguate different versions of the same crate. It is an existing concept and not introduced by this RFC). +However, the symbol name above does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define `mod foo { fn bar() {} }` somewhere. So in order to avoid conflicts in such cases, the absolute path must always include the crate name and disambiguator, as in `N15mycrate_4a3b56d3foo3barE` (the crate disambiguator is used to disambiguate different versions of the same crate. It is an existing concept and not introduced by this RFC). There is another possible ambiguity that we have to take care of. Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form `crate_id::foo::bar` not uniquely identifying the item `bar` because the following snippet is legal Rust code: @@ -95,23 +95,23 @@ mod foo { The function `foo` lives in the value namespaces while the module `foo` lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct `bar` functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name `N15mycrate_4a3b56d3fooV3barVE` and case two get the name `N15mycrate_4a3b56d3foo3barVE` (notice the difference: `3fooV` vs `3foo`). -There is on final case of name ambiguity that we have to take care of. Because of macro hygiene multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code: +There is on final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code: ```rust mod foo { - fn bar() {} - // The second `bar` function was introduce by macro expansion. - fn bar*() {} + fn bar'0() {} + // The second `bar` function was introduced by macro expansion. + fn bar'1() {} } ``` -Then we would encode the two functions symbols as `N15mycrate_4a3b56d3foo3barVE` and `N15mycrate_4a3b56d3foo3barVs_E` respectively (note the `s_` suffix). The details on the shape of this suffix are provided in the reference-level description. +Then we would encode the two functions symbols as `N15mycrate_4a3b56d3foo3barVE` and `N15mycrate_4a3b56d3foo3barVs_E` respectively (note the `s_` suffix in the second case). Details on the shape of this suffix are provided in the reference-level description. As opposed to C++ and other languages that support function overloading, we don't need to include function parameter types in the symbol name. Rust does not allow two functions of the same name but different arguments. The final symbol name for the function would also include the prefix `_R` that is common to all symbol names generated by this scheme: ``` - _RN15mycrate_4a3b56d3foo3barVE` + _RN15mycrate_4a3b56d3foo3barVE <><--------------------------> | | @@ -122,7 +122,7 @@ prefix absolute path ### Generic Functions -Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are listed as suffix, starting with `I`, after the name they belong to. So the instance +Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are listed as a suffix, starting with `I`, after the name they belong to. So the instance ```rust std::mem::align_of:: @@ -139,7 +139,7 @@ _RN12std_a1b2c3d43mem8align_ofVIdEE f64 ``` -where `I` starts the list of arguments, `d` designates `f64` and `E` ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future we might also need to represent *values* when const generics get added to the language.) These kinds of types are: +where `I` starts the list of arguments, `d` designates `f64` and `E` ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future, when const generics get added to the language, we might also need to represent *values*) These kinds of types are: - basic types (`char`, `()`, `str`, `!`, `i8`, `i16`, ...) - reference and pointers types, shared and `mut` @@ -148,7 +148,7 @@ where `I` starts the list of arguments, `d` designates `f64` and `E` ends the ar - structs, enums, closures, and other named types, possibly with their own set of type arguments - function types such as `fn(&i32) -> u16` -Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their fully qualified name (plus arguments) like is done for function symbols. Composites like references, tuples, and function types all follow simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like: +Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their absolute path (including arguments) like is done for function symbols. Composites like references, tuples, and function types all follow a simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like: - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofVIjEE` - `std::mem::align_of::<&char>`: `_RN12std_a1b2c3d43mem8align_ofVIRcEE` @@ -158,8 +158,8 @@ Basic types are all encoded via a single lower-case letter, like in the Itanium There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function `foo` which gets used as `foo` in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an `` suffix to the symbol. So for example the mangling for `std::mem::align_of::` would actually look like this: ``` -_RN12std_a1b2c3d43mem8align_ofVIjEE12foo_a1b2c3d4 (for crate "foo/a1b2c3d4") -_RN12std_a1b2c3d43mem8align_ofVIjEE12bar_11223344 (for crate "bar/11223344") +_RN12std_a1b2c3d43mem8align_ofVIjEE12foo_a1b2c3d4 (for crate "foo[a1b2c3d4]) +_RN12std_a1b2c3d43mem8align_ofVIjEE12bar_11223344 (for crate "bar[11223344]) ``` @@ -170,8 +170,8 @@ The scheme needs to be able to generate symbol names for the function containing ```rust mod foo { fn bar(x: u32) { - let a = |x| { x + 1 }; // ~ 0C - let b = |x| { x + 2 }; // ~ 0Cs_ + let a = |x| { x + 1 }; // local name: 0C + let b = |x| { x + 2 }; // local name: 0Cs_ a(b(x)) } @@ -179,14 +179,14 @@ mod foo { ``` -In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `0C` and the second one the name `0Cs_`. The `0` signifies then length of their (empty) name. The `C` is the namespace tag, analogous to the `V` tag for the value namespace. The `s_` for the second closure is the disambiguation index (index `0` is, again, encoded by not appending a suffix). Their full names would then be `N15mycrate_4a3b56d3foo3barV0CE` and `N15mycrate_4a3b56d3foo3barV0Cs_E` respectively. +In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `0C` and the second one the name `0Cs_`. The `0` signifies the length of their (empty) name. The `C` is the namespace tag, analogous to the `V` tag for the value namespace. The `s_` for the second closure is the disambiguation index (index `0` is, again, encoded by not appending a suffix). Their full names would then be `N15mycrate_4a3b56d3foo3barV0CE` and `N15mycrate_4a3b56d3foo3barV0Cs_E` respectively. ### Methods -Methods are nested within `impl` or `trait` items. As such it would be possible construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::::some_method`. +Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::::some_method`. -However, in the opinion of the author these symbols are very hard to map back to the method they represent. Consider a module containing dozens of types, each with multiple `impl` blocks generated via `#[derive(...)]`. In order to find out what which method a symbol belongs to, one would have to count the number of handwritten and macro generated `impl` blocks in the module, and hope that one correctly guessed the number of `impl` blocks introduced by the given derive-macro (each macro invocation can introduce `0..n` such blocks). The name of the method might give a hint, but there are still likely to be dozens of methods named `clone`, `hash`, `eq`, et cetera. +However, in the opinion of the author these symbols are very hard to map back to the method they represent. Consider a module containing dozens of types, each with multiple `impl` blocks generated via `#[derive(...)]`. In order to find out which method a symbol maps to, one would have to count the number of handwritten and macro generated `impl` blocks in the module, and hope that one correctly guessed the number of `impl` blocks introduced by the given derive-macro (each macro invocation can introduce `0..n` such blocks). The name of the method might give a hint, but there are still likely to be dozens of methods named `clone`, `hash`, `eq`, et cetera. The RFC therefore proposes to keep symbol names close to how methods are represented in error messages, that is: @@ -199,7 +199,7 @@ This can be achieved by extending the definition of absolute paths that we have - an `M` signifies a path with a single type as its root, and - an `X` signifies a path with a self-type/trait pair as its root. -Thus, this extend form of paths would have the following syntax: +Thus, this extended form of paths would have the following syntax: ``` := N * [I E] E @@ -233,7 +233,7 @@ fn foo(a: T) -> (u32, T) { } ``` -The `X` here (or any other such nested definition) does not inherit the generic context. `X` is non-generic, and a function defined in its place would be too. Consequently, when giving the path to something defined within a generic item, one does not specify the generic arguments because they add no information. The fully qualified name of `X` is thus `my_crate/a1b2c3d4::foo::X` and its symbol name: +The `X` here (or any other such nested definition) does not inherit the generic context. `X` is non-generic, and a function defined in its place would be too. Consequently, when giving the path to something defined within a generic item, one does not specify the generic arguments because they add no information. The fully qualified name of `X` is thus `my_crate[a1b2c3d4]::foo::X` and its symbol name: ``` _RN15mycrate_4a3b56d3fooF1XFE @@ -241,7 +241,7 @@ _RN15mycrate_4a3b56d3fooF1XFE However, there is at least one case where the type arguments *do* matter for a definition like this, and that is when trait specialization is used. Consider the following piece of code: -``` +```rust trait Foo { fn foo() -> T; } @@ -265,7 +265,7 @@ impl Foo for Bar { ``` -Notice that both `MSG` statics have the path `::foo::MSG` if you just leave off the type arguments. However, we also don't have any concrete types to substitute the arguments for. Therefore, we have to disambiguate the `impls`. Since trait specialization is an unstable feature of Rust and the details are in flux, this RFC does not try to provide a mangling based on the `where` clauses of the specialized `impls`. Instead it proposes a scheme that re-uses the introduced numeric disambiguator form already used for macro hygiene and closures. Thus, conflicting `impls` would be disambiguated via an implementation defined suffix, as in `'1::foo::MSG` and `'2::foo::MSG`. This encoding introduces minimal additional syntax and can be replaced with something more human-readable once the definition of trait specialization is final. +Notice that, if one just omits the type arguments, both `MSG` statics have the path `::foo::MSG`. However, we cannot disambiguate by adding type arguments, because we don't have any concrete types to substitute the arguments for. Therefore, we have to disambiguate the `impls`. Since trait specialization is an unstable feature of Rust and the details are in flux, this RFC does not try to provide a mangling based on the `where` clauses of the specialized `impls`. Instead it proposes to re-use the "numeric disambiguator" form already used for macro hygiene and closures. Thus, conflicting `impls` would be disambiguated via an implementation defined suffix, as in `'1::foo::MSG` and `'2::foo::MSG`. This encoding introduces minimal additional syntax and can be replaced with something more human-readable once the definition of trait specialization is final. ### Unicode Identifiers @@ -297,7 +297,7 @@ mod gödel { would be mangled as: ``` -_RN15mycrate_4a3b56d8gdel_5qau6escher4bachVE` +_RN15mycrate_4a3b56d8gdel_Fqau6escher4bachVE` <--------> Unicode component ``` @@ -410,8 +410,9 @@ Mangled names conform to the following grammar: | // The is the length of the identifier in bytes. -// is must not start with a decimal digit. +// is the identifier itself and must not start with a decimal digit. // If the "u" is present then is Punycode-encoded. +// "V" and "C" are the tags for value and closure namespaces respectively. = ["u"] ["V"|"C"] [] = @@ -510,7 +511,7 @@ With this post-processing in place the Punycode strings can be treated like regu ## Compression -From a high-level perspective symbol name compression works by replacing parts of the mangled name that have already been seen with a substitution marker identifying the already seen part. Which parts are eligible for substitution is defined via the AST of the name (as described in the previous section). Let's define some terms first: +From a high-level perspective symbol name compression works by substituting parts of the mangled name that have already been seen for a back-reference. Which parts are eligible for substitution is defined via the AST of the name (as described in the previous section). Before going into the actual algorithm, let's define some terms: - Two AST nodes are *equivalent* if they contain the same information. In general this means that two nodes are equivalent if the sub-trees they are the root of are equal. However, there is another condition that can make two nodes equivalent. If a node `N` has a single child node `C` and `N` does not itself add any new information, then `N` and `C` are equivalent too. The exhaustive list of these special cases is: @@ -578,13 +579,13 @@ Some interesting things to note in this example: ## Decompression -Decompression works analogously to compression: +Decompression works analogously to compression, only this time, the substitution dictionary maps substitution indices to nodes instead of the other way round: - Initialize the substitution dictionary to be empty. - Traverse and modify the AST as follows: - When encountering a substitutable node `N` there are two cases 1. If the node has a single `` child, extract the substitution index from it and replace the node with the corresponding entry from the substitution dictionary. - 2. Else, continue traversing the child nodes of the current node. After the child nodes have been traversed, and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its key. + 2. Else, continue traversing the child nodes of the current node. After the child nodes have been traversed, and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its value. This is what the example from above looks like for decompression: @@ -608,15 +609,15 @@ This is what the example from above looks like for decompression: ``` -### A Note On Implementing Efficient Demangling +### A Note On Implementing Efficient Demanglers -The mangling syntax is constructed in a way that allows for implementing an efficient demangler: +The mangling syntax is constructed in a way that allows for implementing efficient demanglers: - Mangled names contain information in the same order as unmangled names are expected to contain it. Therefore, a demangler can directly generate its output while parsing the mangled form. There is no need to explicitly instantiate the AST in memory. - The same is true for decompression. The demangler can keep a simple array that maps substitution indices to ranges in the already generated output. When it encounters a `` in need of expansion, it can just look up corresponding range and do a simple `memcpy`. -Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need to do dynamic allocation except for dictionary array. +Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need to do dynamic allocation except for the dictionary array. ## Mapping Rust Language Entities to Symbol Names @@ -627,7 +628,9 @@ This RFC suggests the following mapping of Rust entities to mangled names: - Absolute paths should be rooted at the inner-most entity that can act as a path root. Roots can be crate-ids, types (for entities with an inherent impl in their path), and trait impls (for entities with trait impls in their path). -- The compiler is free to choose disambiguation indices for identifiers and trait impls that need disambiguation. The disambiguation index `0` is represented by omitting the `` production (which should be the common case). Disambiguation indices do not need to be densely packed. In particular the compiler can use arbitrary hashes to disambiguate items (which is useful for supporting specializing trait impls). +- The disambiguation index for an identifier in the type, value, and closure namespaces is determined by counting the number of occurrences of that identifier within its parent context (i.e. the fully macro-expanded AST). The disambiguation index `0` is represented by omitting the `` production (which should be the common case). + +- The compiler is free to choose disambiguation indices for specializing trait impls. Disambiguation indices do not need to be densely packed. In particular the compiler can use arbitrary hashes to disambiguate specializing trait impls. # Drawbacks @@ -635,7 +638,7 @@ This RFC suggests the following mapping of Rust entities to mangled names: Why should we *not* do this? -- The scheme is rather complex, especially due to compression (albeit less complex than prior art) +- The scheme is complex, especially due to compression---albeit less complex than prior art and probably not more complex than the current scheme, if we were to describe that formally. - The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current mangling scheme is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information. @@ -644,7 +647,7 @@ Why should we *not* do this? The alternatives considered are: - 1. Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that it confusing and leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers. + 1. Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers. 2. Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. @@ -660,9 +663,7 @@ The Itanium mangling (and by extension the scheme proposed here) could be consid [prior-art]: #prior-art The mangling scheme described here is an adaptation of the [Itanium C++ ABI][itanium-mangling] scheme, -which is the scheme used by the GCC toolchain (and clang when it's not compiling for MSVC). In fact, -the scheme proposed here tries to stay as close as possible to Itanium mangling and only deviates -where something does not make sense for Rust. +which is the scheme used by the GCC toolchain (and clang when it's not compiling for MSVC). One notable improvement the proposed scheme makes upon Itanium mangling is explicit handling of unicode identifiers. The idea of using [Punycode][punycode] for this is taken from the @@ -679,23 +680,38 @@ Itanium mangling). # Unresolved questions [unresolved-questions]: #unresolved-questions +### Punycode vs UTF-8 +During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision: + +- Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference. + +- Punycode support is non-optional since some platforms only allow a very limited character set for symbol names. Thus, we would be using UTF-8 on some platforms and Punycode on others, making it harder to predict what a symbol name for a given item looks like. + +- Punycode encoding and decoding is more runtime effort for the mangler and demangler. + +- Once a demangler supports Punycode, it is not much effort to support both encodings. The `u` identifier suffix tells the demangler whether it's Punycode. Otherwise it can just assume UTF-8 which already subsumes ASCII. + +### Re-use for crate disambiguator + +The RFC currently proposes to represent crate-ids as an `` of the form `_`. However, the `` production already supports disambiguation via its `` component. The crate disambiguator could be encoded into an disambiguation index. + # Appendix A - Suggested Demangling -This RFC suggests that names are demangling to a form that matches Rust syntax as it is used in source code, compiler error messages and `rustdoc`: +This RFC suggests that names are demangled to a form that matches Rust syntax as it is used in source code, compiler error messages and `rustdoc`: -- Path components should be separated by `::`. + - Path components should be separated by `::`. -- If the path root is a `` it should be printed as the crate name. If the context requires it for correctness, the crate disambiguator should be printed too, as in, for example, `std[a0b1c2d3]::collections::HashMap`. In this case `a0b1c2d3` would be the disambiguator. Usually, the disambiguator can be omitted for better readability. + - If the path root is a `` it should be printed as the crate name. If the context requires it for correctness, the crate disambiguator should be printed too, as in, for example, `std[a0b1c2d3]::collections::HashMap`. In this case `a0b1c2d3` would be the disambiguator. Usually, the disambiguator can be omitted for better readability. -- If the path root is a trait impl, it should be printed as ``, like the compiler does in error messages. + - If the path root is a trait impl, it should be printed as ``, like the compiler does in error messages. -- The list of generic arguments should be demangled as ``. + - The list of generic arguments should be demangled as ``. -- Identifiers and trait impl path roots can have a numeric disambiguator (the `` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s_` then the index is ` + 2`. The suggested demangling of a disambiguator is `[]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always emitted. + - Identifiers and trait impl path roots can have a numeric disambiguator (the `` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s_` then the index is ` + 2`. The suggested demangling of a disambiguator is `[]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted. The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[]`. -- In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted. + - In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted. # Appendix B - Interesting Examples @@ -712,7 +728,7 @@ mod foo { } ``` - unmangled: `mycrate::foo::bar::baz` -- mangled: `_RN3foo3bar3bazVE` +- mangled: `_RN11mycrate_xxx3foo3bar3bazVE` ### Item Defined In Inherent Method @@ -728,7 +744,7 @@ impl Foo { } ``` - unmangled: `mycrate::Foo::bar::QUUX` -- mangled: `_RNNM11mycrate_xxx3FooE3barV4QUUXVE` +- mangled: `_RNMN11mycrate_xxx3FooE3barV4QUUXVE` ### Item Defined In Trait Method @@ -773,7 +789,7 @@ pub static QUUX: u32 = { - mangled: `_RN11mycrate_xxx4QUUXV3FOOVE` -### Compressed Prefix Constructed From Prefix That Contains Substitution Itself +### Compressed Prefix Constructed From Prefix That Contains A Substitution Itself - unmangled: `std[xxx]::foo` - mangled: `_RN7std_xxx3fooFINS_3barFENS1_3bazFEEE` From 2c157ce2a04fed31fea3bd40d33d17c52937f854 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Thu, 29 Nov 2018 15:14:03 +0100 Subject: [PATCH 10/18] Fix typo. --- text/0000-symbol-name-mangling-v2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 07458cbacc0..04e166fa53f 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -49,7 +49,7 @@ A symbol mangling scheme has a few goals, one of them essential, the rest of the - A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix. - - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases. + - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predict the symbol name, even for simple cases. - A mangling scheme should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). From 69fb61bccf7b0558811f49a0cf0c0bc03b514bb5 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Thu, 29 Nov 2018 15:15:45 +0100 Subject: [PATCH 11/18] Clarify use of base-62 numbers in grammar specification. --- text/0000-symbol-name-mangling-v2.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 04e166fa53f..0693a44d959 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -473,11 +473,13 @@ Mangled names conform to the following grammar: "u" // Unadjusted ) - = "s" [] "_" +// uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and +// 'Z' is decimal 61. + = "s" [] "_" = "I" {} "E" - = "S" [] "_" + = "S" [] "_" // We use here, so that we don't have to add a special rule for // compression. In practice, only is expected. From 68782e0c3a415ff23714fb8358849ca71f3f6af5 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Fri, 22 Feb 2019 11:32:40 +0100 Subject: [PATCH 12/18] Adapt RFC to new scheme as proposed by eddyb. --- text/0000-symbol-name-mangling-v2.md | 1095 +++++++++++++++++--------- 1 file changed, 720 insertions(+), 375 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 0693a44d959..faba73376a7 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -6,20 +6,39 @@ # Summary [summary]: #summary -This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling]. - -Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable Rust ABI. In the future it could be part of both and it is designed to be stable and extensible but for the time being it would still be an implementation detail of the Rust compiler. +This RFC proposes a new mangling scheme that describes what the symbol +names generated by the Rust compiler look like. This new scheme has a number of +advantages over the existing one which has grown over time without a +clear direction. The new scheme is consistent, depends less on +compiler internals, and the information it stores in symbol names can +be decoded again which provides an improved experience for users of +external tools that work with Rust symbol names. + +Note that, at this point, the new mangling scheme would not be part of +the language specification or the specification of a stable Rust ABI. +In the future it _could_ be part of both and it is designed to be +stable and extensible; but for the time being it would still be an +implementation detail of the Rust compiler. # Motivation [motivation]: #motivation -Due to its ad-hoc nature, the compiler's current name mangling scheme has a -number of drawbacks: +Due to its ad-hoc nature, the compiler's current name mangling scheme +has a number of drawbacks: + +- Information about generic parameters and other things is lost in the + mangling process. One cannot extract the type arguments of a + monomorphized function from its symbol name. + +- The current scheme is inconsistent: most paths use + [Itanium ABI][itanium-mangling] style encoding, but some don't. + +- The symbol names it generates can contain `.` characters which is + not generally supported on all platforms. \[[1][gas]\] + \[[2][lld-windows-bug]\] \[[3][thin-lto-bug]\] -- It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool. -- Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name. -- The current scheme is inconsistent: most paths use Itanium style encoding, but some don't. -- The symbol names it generates can contain `.` characters which is not generally supported on all platforms. \[[1][gas]\]\[[2][lld-windows-bug]\] \[[3][thin-lto-bug]\] +- It depends on compiler internals and its results cannot be replicated + by another compiler implementation or external tool. [gas]: https://sourceware.org/binutils/docs/as/Symbol-Names.html#Symbol-Names [lld-windows-bug]: https://github.com/rust-lang/rust/issues/54190 @@ -27,49 +46,99 @@ number of drawbacks: The proposed scheme solves these problems: -- It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time. - It encodes information about generic parameters in a reversible way. -- It has a consistent definition that does not rely on pretty-printing certain language constructs. -- It generates symbols that only consist of the characters `A-Z`, `a-z`, `0-9`, and `_`. +- It has a consistent definition that does not rely on pretty-printing + certain language constructs. +- It generates symbols that only consist of the characters `A-Z`, `a-z`, + `0-9`, and `_`. +- While the proposed scheme still contains things that are implementation + defined it has a clearer path towards full name predictability in future. -This should make it easier for third party tools to work with Rust binaries. +These properties should make it easier for third party tools to work +with Rust binaries. # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -The following section will lay out the requirements for a name mangling scheme and then introduce the actual scheme through a series of ever more complex examples. +The following section will lay out the requirements for a name mangling +scheme and then introduce the actual scheme through a series of ever +more complex examples. ## Requirements for a Symbol Mangling Scheme -A symbol mangling scheme has a few goals, one of them essential, the rest of them desirable. The essential one is: - -- The scheme must provide an unambiguous string encoding for everything that can end up in a binary's symbol table. - -"Unambiguous" means that no two distinct compiler-generated entities (that is, mostly object code for functions) must be mapped to the same symbol name. This disambiguation is the main purpose of the hash-suffix in the current, legacy mangling scheme. The scheme proposed here, on the other hand, achieves it in a way that allows to also satisfy a number of additional desirable properties of a mangling scheme: - - - A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix. - - - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo` or `foo, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predict the symbol name, even for simple cases. - - - A mangling scheme should be platform-independent. This is mainly achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, `_`. All other characters might have special meaning in some context (e.g. `.` for MSVC `DEF` files) or are simply not supported (e.g. Unicode). - - - The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for the compiler and the linker). In addition, generating or demangling a symbol name should not be too computationally expensive. +A symbol mangling scheme has a few goals, one of them essential, +the rest of them desirable. The essential one is: + +- The scheme must provide an unambiguous string encoding for + everything that can end up in a binary's symbol table. + +"Unambiguous" means that no two distinct compiler-generated entities +(that is, mostly object code for functions) must be mapped to the same +symbol name. This disambiguation is the main purpose of the hash-suffix +in the current, legacy mangling scheme. The scheme proposed here, on +the other hand, achieves it in a way that allows to also satisfy a +number of additional desirable properties of a mangling scheme: + + - A mangled symbol should be *decodable* to some degree. That is, it + is desirable to be able to tell which exact concrete instance of e.g. + a polymorphic function a given symbol identifies. This is true for + external tools, backtraces, or just people only having the binary + representation of some piece of code available to them. With the + current scheme, this kind of information gets lost in the magical + hash-suffix. + + - A mangling scheme should be platform-independent. This is mainly + achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, + `_`. All other characters might have special meaning in some + context (e.g. `.` for MSVC `DEF` files) or are simply not + supported (e.g. Unicode). + + - The scheme should be efficient, meaning that the symbols it + produces are not unnecessarily long (because that takes up space + in object files and means more work for the compiler and the linker). + In addition, generating or demangling a symbol name should not be + too computationally expensive. + + - When used as part of a stable ABI, it should be possible to predict + the symbol name for a given source-level construct. For example, + given the definition `fn foo() { ... }`, the scheme should allow + to construct, by hand, the symbol names for e.g. `foo` or + `foo, ...) -> !>()`. + Since the current scheme generates its hash from the values of + various compiler internal data structures, an alternative compiler + implementation could not predict the symbol name, even for + simple cases. Note that the scheme proposed here does not fulfill + this requirement either (yet) as some things are still left to + the compiler implementation. The RFC also has a couple of non-goals: - - Source-level definitions can contain components that will not show up in symbol names, like lifetimes (as in `fn foo<'a>()`). This RFC does not define a mangling for cases like these. One might want to cover them "for completeness" but they are not actually needed. - - - The mangling scheme does not try to be compatible with an existing C++ mangling scheme. While it might sound tempting to encode Rust symbols with an existing scheme, it is the author's opinion that the actual benefits are small (C++ tools would not demangle to Rust syntax, demanglings would be hard to read) and at the same time supporting a Rust-specific scheme in existing tools seems quite feasible (many tools like GDB, LLDB, binutils, and valgrind already have specialized code paths for Rust symbols). + - The mangling scheme does not try to be compatible with an existing + (e.g. C++) mangling scheme. While it might sound tempting to encode Rust + symbols with an existing scheme, it is the author's opinion that + the actual benefits are small (C++ tools would not demangle to Rust + syntax, demanglings would be hard to read) and at the same time + supporting a Rust-specific scheme in existing tools seems quite + feasible (many tools like GDB, LLDB, binutils, and valgrind already + have specialized code paths for Rust symbols). + - The RFC does not try to define a standardized _demangled_ form for + symbol names. It defines the mangled form and makes sure it can be + demangled in an efficient manner but different demanglers still + have some degree of freedom regarding how symbol names are presented + to the user. ## The Mangling Scheme by Example -This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and will see how that already involves things that might be surprising. +This section will develop an overview of the mangling scheme by walking +through a number of examples. We'll start with the simplest case -- and +will see how that already involves things that might be surprising. ### Free-standing Functions and Statics -A free-standing function is fully identified via its absolute path. For example, the following function +A free-standing function is fully identified via its absolute path. +For example, the following function ```rust mod foo { @@ -77,11 +146,31 @@ mod foo { } ``` -has the path `foo::bar` and `N3foo3barE` is a mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is the encoding that the [Itanium C++ ABI][itanium-mangling] name mangling scheme uses for "nested names" (i.e. paths). The scheme proposed here will also use this format because it does not need termination tokens for identifiers (which are hard to come by with our limited character set). - -However, the symbol name above does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define `mod foo { fn bar() {} }` somewhere. So in order to avoid conflicts in such cases, the absolute path must always include the crate name and disambiguator, as in `N15mycrate_4a3b56d3foo3barE` (the crate disambiguator is used to disambiguate different versions of the same crate. It is an existing concept and not introduced by this RFC). - -There is another possible ambiguity that we have to take care of. Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form `crate_id::foo::bar` not uniquely identifying the item `bar` because the following snippet is legal Rust code: +has the path `foo::bar` and `NN3foo3bar` is a possible mangling of that path +that complies to the character set we are restricted to. Why this format with +numbers embedded in it? It is a run-length encoding, similar to what the +[Itanium C++ ABI][itanium-mangling] name mangling scheme uses for +identifiers. The scheme proposed here will also use this +format because it does not need termination tokens for identifiers +(which are hard to come by with our limited character set). + +Note that each component in the path (i.e. `foo` and `bar`) also has an +accompanying _start-tag_ (here `N`) at the beginning. This start-tag is +needed in order for the syntax to be able to represent complex, nested +structures as we will see later. + +The symbol name above, unfortunately, does not unambiguously identify the +function in every context. It is perfectly valid for another crate +to also define `mod foo { fn bar() {} }` somewhere. So in order to +avoid conflicts in such cases, the absolute path must always include +the crate-id, as in `NNC7mycrate3foo3bar`. The crate-id has a `C` +start-tag. + +There is another possible ambiguity that we have to take care of. +Rust has two distinct namespaces: the type and the value namespace. +This leads to a path of the form `crate_id::foo::bar` not uniquely +identifying the item `bar` because the following snippet is legal +Rust code: ```rust fn foo() { @@ -93,9 +182,22 @@ mod foo { } ``` -The function `foo` lives in the value namespaces while the module `foo` lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct `bar` functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name `N15mycrate_4a3b56d3fooV3barVE` and case two get the name `N15mycrate_4a3b56d3foo3barVE` (notice the difference: `3fooV` vs `3foo`). - -There is on final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name `foo` within its parent is actually treated as `foo'0`, the second occurrence would be `foo'1`, the next `foo'2`, and so one. The mangling scheme will adopt this setup by appending a disambiguation suffix to each identifier with a non-zero index. So if macro expansion would result in the following code: +The function `foo` lives in the value namespace while the module `foo` +lives in the type namespace. They don't interfere. In order to make the +symbol names for the two distinct `bar` functions unique, we thus add a +namespace identifier to the start-tag of components where necessary, as in +`NvNvC7mycrate3foo3bar` for the first case and `NvNtC7mycrate3foo3bar` +second case (notice the difference: `NvNv...` vs `NvNt...`). + +There is one final case of name ambiguity that we have to take care of. +Because of macro hygiene, multiple items with the same name can appear in +the same context. The compiler internally disambiguates such names by +augmenting them with a numeric index. For example, the first occurrence +of the name `foo` within its parent is actually treated as `foo'0`, the +second occurrence would be `foo'1`, the next `foo'2`, and so one. The +mangling scheme will adopt this setup by prepending a disambiguation +prefix to each identifier with a non-zero index. So if macro expansion +would result in the following code: ```rust mod foo { @@ -104,25 +206,46 @@ mod foo { fn bar'1() {} } ``` -Then we would encode the two functions symbols as `N15mycrate_4a3b56d3foo3barVE` and `N15mycrate_4a3b56d3foo3barVs_E` respectively (note the `s_` suffix in the second case). Details on the shape of this suffix are provided in the reference-level description. -As opposed to C++ and other languages that support function overloading, we don't need to include function parameter types in the symbol name. Rust does not allow two functions of the same name but different arguments. +then we would encode the two functions symbols as `NvNtC7mycrate3foo3bar` +and `NvNtC7mycrate3foos_3bar` respectively (note the `s_` prefix in the +second case). A very similar disambiguation is needed for avoiding +conflicts between crates of the same name but different versions. The +same syntactic prefix is thus used for crate-id where we encode the +crate disambiguator as in `NtNvCs1234_7mycrate3foo3bar`. Details on +the shape of this prefix are provided in the reference-level description. -The final symbol name for the function would also include the prefix `_R` that is common to all symbol names generated by this scheme: - -``` - _RN15mycrate_4a3b56d3foo3barVE - - <><--------------------------> - | | -prefix absolute path +As opposed to C++ and other languages that support function overloading, +we don't need to include function parameter types in the symbol name. +Rust does not allow two functions of the same name but different arguments. +The final symbol name for the function would also include the prefix +`_R` that is common to all symbol names generated by this scheme: +``` + _RNvNtCs1234_7mycrate3foo3bar + <>^^^^^<----><------><--><--> + |||||| | | | | + |||||| | | | +--- "bar" identifier + |||||| | | +------- "foo" identifier + |||||| | +------------- "mycrate" identifier + |||||| +-------------------- disambiguator for "mycrate" + |||||+------------------------ start-tag for "mycrate" + ||||+------------------------- namespace tag for "foo" + |||+-------------------------- start-tag for "foo" + ||+--------------------------- namespace tag for "bar" + |+---------------------------- start-tag for "bar" + +----------------------------- common Rust symbol prefix ``` + ### Generic Functions -Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are listed as a suffix, starting with `I`, after the name they belong to. So the instance +Each monomorphization of a generic function has its own symbol name. +The monomorphizations are disambiguated by the list of concrete generic +arguments. These arguments are added to the symbol name by a pair of `I` +start-tag at the beginning and a list of the actual arguments at the end. +So the instance ```rust std::mem::align_of:: @@ -131,47 +254,77 @@ std::mem::align_of:: would be mangled to ``` -_RN12std_a1b2c3d43mem8align_ofVIdEE - ^^^ - ||| - start of argument list ---+|+--- end of argument list - | - f64 +_RINvNtC3std3mem8align_ofdE + ^ ^^ + | || + | |+--- end of argument list + | +--- f64 + +--- start-tag ``` -where `I` starts the list of arguments, `d` designates `f64` and `E` ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future, when const generics get added to the language, we might also need to represent *values*) These kinds of types are: +where `I` precedes the thing the arguments belong to, `d` designates `f64` +and `E` ends the argument list. As we can see, we need to be able to +represent all kinds of types that can be part of such an argument list. +(In the future, when const generics get added to the language, we will +also need to represent *values*) These kinds of types are: - basic types (`char`, `()`, `str`, `!`, `i8`, `i16`, ...) - - reference and pointers types, shared and `mut` + - reference and pointers types, shared, `mut` and `const` - tuples - - arrays, with and without fixed size (e.g. `[u8]`, `[u8; 17]`, or as part of a slice type `&[char]`) - - structs, enums, closures, and other named types, possibly with their own set of type arguments + - arrays, with and without fixed size (e.g. `[u8]`, `[u8; 17]`) + - structs, enums, closures, and other named types, possibly with their + own set of type arguments - function types such as `fn(&i32) -> u16` - -Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their absolute path (including arguments) like is done for function symbols. Composites like references, tuples, and function types all follow a simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like: - - - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofVIjEE` - - `std::mem::align_of::<&char>`: `_RN12std_a1b2c3d43mem8align_ofVIRcEE` - - `std::mem::align_of::`: `_RN12std_a1b2c3d43mem8align_ofVIN12std_a1b2c3d43mem12DiscriminantEEE` - - `std::mem::align_of::<&mut (&str,())>`: `_RN12std_a1b2c3d43mem8align_ofVIWTRrvEEE` - -There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function `foo` which gets used as `foo` in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an `` suffix to the symbol. So for example the mangling for `std::mem::align_of::` would actually look like this: + - `dyn` traits + +Basic types are all encoded via a single lower-case letter, like in the +Itanium scheme. Named types are encoded as their absolute path +(including arguments) like is done for function symbols. Composites like +references, tuples, and function types all follow a simple grammar given +in the reference-level explanation below. Here are some example manglings +to get a general feel of what they look like: + + - `std::mem::align_of::`: `_RINvNtC3std3mem8align_ofjE` + - `std::mem::align_of::<&char>`: `_RINvNtC3std3mem8align_ofRcE` + - `std::mem::align_of::`: + `_RINvNtC3std3mem8align_ofNtNtC3std3mem12DiscriminantE` + - `std::mem::align_of::<&mut (&str,())>`: `_RINvNtC3std3mem8align_ofQTReuEE` + +There's one more thing we have to take into account for generic functions: +The compiler may produce "crate-local" copies of a monomorphization. +That is, if there is a function `foo` which gets used as `foo` +in two different crates, the compiler (depending on the optimization level) +might generate two distinct functions at the LLVM IR level, each with it's +own symbol. In order to support this without running into conflicts, symbol +names for monomorphizations must include the id of the crate they are +instantiated for. This scheme does this by appending an `` suffix +to the symbol. So for example the mangling for `std::mem::align_of::` +would actually look like this: ``` -_RN12std_a1b2c3d43mem8align_ofVIjEE12foo_a1b2c3d4 (for crate "foo[a1b2c3d4]) -_RN12std_a1b2c3d43mem8align_ofVIjEE12bar_11223344 (for crate "bar[11223344]) +_RINvNtC3std3mem8align_ofjEC3foo (for crate "foo") +_RINvNtC3std3mem8align_ofjEC3bar (for crate "bar") ``` ### Closures and Closure Environments -The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme proposes to use the namespace and disambiguation mechanisms already introduced above for this purpose. Closures get their own "namespace" (i.e. they are neither in the type nor the value namespace), and each closure has an empty name with a disambiguation index (like for macro hygiene) identifying them within their parent. The full name of a closure is then constructed like for any other named item: +The scheme needs to be able to generate symbol names for the function +containing the code of a closure and it needs to be able to refer to +the type of a closure if it occurs as a type argument. As closures +don't have a name, we need to generate one. The scheme proposes to +use the namespace and disambiguation mechanisms already introduced +above for this purpose. Closures get their own "namespace" (i.e. +they are neither in the type nor the value namespace), and each closure +has an empty name with a disambiguation index (like for macro hygiene) +identifying them within their parent. The full name of a closure is +then constructed like for any other named item: ```rust mod foo { fn bar(x: u32) { - let a = |x| { x + 1 }; // local name: 0C - let b = |x| { x + 2 }; // local name: 0Cs_ + let a = |x| { x + 1 }; // local name: NC<...>0 + let b = |x| { x + 2 }; // local name: NC<...>s_0 a(b(x)) } @@ -179,98 +332,149 @@ mod foo { ``` -In the above example we have two closures, the one assigned to `a` and the one assigned to `b`. The first one would get the local name `0C` and the second one the name `0Cs_`. The `0` signifies the length of their (empty) name. The `C` is the namespace tag, analogous to the `V` tag for the value namespace. The `s_` for the second closure is the disambiguation index (index `0` is, again, encoded by not appending a suffix). Their full names would then be `N15mycrate_4a3b56d3foo3barV0CE` and `N15mycrate_4a3b56d3foo3barV0Cs_E` respectively. +In the above example we have two closures, the one assigned to `a` +and the one assigned to `b`. The first one would get the local name +`NC<...>0` and the second one the name `NC<...>s_0`. The `0` signifies +the length of their (empty) name. The `<...>` part is the path of the +parent. The `C` is the namespace tag, analogous to the `v` tag for +the value namespace. The `s_` for the second closure is the +disambiguation index (index `0` is, again, encoded by not prepending +a prefix). Their full names would then be `NCNvNtC7mycrate3foo3bar0` +and `NCNvNtC7mycrate3foo3bars_0` respectively. ### Methods -Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::::some_method`. - -However, in the opinion of the author these symbols are very hard to map back to the method they represent. Consider a module containing dozens of types, each with multiple `impl` blocks generated via `#[derive(...)]`. In order to find out which method a symbol maps to, one would have to count the number of handwritten and macro generated `impl` blocks in the module, and hope that one correctly guessed the number of `impl` blocks introduced by the given derive-macro (each macro invocation can introduce `0..n` such blocks). The name of the method might give a hint, but there are still likely to be dozens of methods named `clone`, `hash`, `eq`, et cetera. - -The RFC therefore proposes to keep symbol names close to how methods are represented in error messages, that is: - -- `Foo::some_method` for inherent methods, and +Methods are nested within `impl` or `trait` items. As such it would be +possible to construct their symbol names as paths like +`my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies +the the `impl` in question. Since `impl`s don't have names, we'd have to +use an indexing scheme like the one used for closures (and indeed, this is +what the compiler does internally). Adding in generic arguments to, this +would lead to symbol names looking like +`my_crate::foo::impl'17::::some_method`. + +However, in the opinion of the author these symbols are very hard to map +back to the method they represent. Consider a module containing dozens of +types, each with multiple `impl` blocks generated via `#[derive(...)]`. +In order to find out which method a symbol maps to, one would have to count +the number of handwritten _and_ macro generated `impl` blocks in the module, +and hope that one correctly guessed the number of `impl` blocks introduced +by the given derive-macro (each macro invocation can introduce `0..n` such +blocks). The name of the method might give a hint, but there are still +likely to be dozens of methods named `clone`, `hash`, `eq`, et cetera. + +The RFC therefore proposes to keep symbol names close to how methods are +represented in error messages, that is: + +- `>::some_method` for inherent methods, and - ` as SomeTrait>::some_method` for trait methods. -This can be achieved by extending the definition of absolute paths that we have used so far. Instead of the path root always being a crate-id, we now also allow a path to start with a single type (i.e. the self-type of an inherent method) or with a pair of self-type and the trait being implemented. The kind of root is indicated by the first character of the `N` starting the path: - -- a decimal digit signifies a path with a crate-id root (since crate-ids always start with a digit), -- an `M` signifies a path with a single type as its root, and -- an `X` signifies a path with a self-type/trait pair as its root. +This can be achieved by extending the definition of paths that we have +used so far. Instead of the path root always being a crate-id, we now +also allow a path to start with an `impl` production that contains the +self-type and (for trait methods) the name of the trait being implemented. Thus, this extended form of paths would have the following syntax: ``` - := N * [I E] E - - := - | M - | X + = C // crate-id root + | M // inherent impl root + | X // trait impl root + | N // nested path + | I {} E // generic arguments ``` Here are some examples for complete symbol names: ``` -::foo => _RNXmN12mycrate_abcd3FooE3fooVE -mycrate::Foo::foo => _RNMN12mycrate_abcd3FooImEE3fooVE - as mycrate::Bar>::foo => _RNXN12mycrate_abcd3FooImEEN12mycrate_abcd3BarIyEE3fooVE +>::foo => _RNvMINtC7mycrate3FoomE3foo +::foo => _RNvXmNtC7mycrate3Foo3foo + as mycrate::Bar>::foo => _RNvXINtC7mycrate3FoomEINtC7mycrate3BaryE3foo ``` -### Items Within Specialized Trait Impls +### Items Within Generic Impls -In Rust one can define items within generic items, e.g. functions or impls, like in the following example: +In Rust one can define items within generic items, e.g. functions or +impls, like in the following example: ```rust -fn foo(a: T) -> (u32, T) { - static mut X: u32 = 0; +struct Foo(T); - unsafe { - X += 1; - (X, a) +impl From for Foo { + fn from(x: T) -> Self { + static MSG: &str = "..."; + panic!("{}", MSG) } } ``` -The `X` here (or any other such nested definition) does not inherit the generic context. `X` is non-generic, and a function defined in its place would be too. Consequently, when giving the path to something defined within a generic item, one does not specify the generic arguments because they add no information. The fully qualified name of `X` is thus `my_crate[a1b2c3d4]::foo::X` and its symbol name: +The `MSG` here (or any other such nested definition) does not inherit +the generic context from the `impl`. `MSG` is non-generic, and a +function defined in its place would be too. The fully qualified name +of `MSG`, according to our examples so far, is thus +` as std::convert::From<_>>::from::MSG` and its symbol name: ``` -_RN15mycrate_4a3b56d3fooF1XFE +_RNvNvXINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG ``` -However, there is at least one case where the type arguments *do* matter for a definition like this, and that is when trait specialization is used. Consider the following piece of code: +However, with trait specialization, this symbol can be ambiguous. +Consider the following piece of code: ```rust -trait Foo { - fn foo() -> T; -} - -struct Bar(T); +struct Foo(T); -impl Foo for Bar { - default fn foo() -> T { - static MSG: &str = "sry, no can do"; +impl From for Foo { + default fn from(x: T) -> Self { + static MSG: &str = "..."; panic!("{}", MSG) } } -impl Foo for Bar { - fn foo() -> T { - static MSG: &str = "it's a go!"; - println!("{}", MSG); - T::default() +impl From for Foo { + fn from(x: T) -> Self { + static MSG: &str = "123"; + panic!("{}", MSG) } } +``` + +Notice that both `MSG` statics have the path ` as From<_>>::foo::MSG`. +We somehow have to disambiguate the `impls`. We do so by adding the path of +the `impl` to the symbol name. ``` + = C // crate-id root + | M // inherent impl root + | X // trait impl root + | N // nested path + | I {} E // generic arguments -Notice that, if one just omits the type arguments, both `MSG` statics have the path `::foo::MSG`. However, we cannot disambiguate by adding type arguments, because we don't have any concrete types to substitute the arguments for. Therefore, we have to disambiguate the `impls`. Since trait specialization is an unstable feature of Rust and the details are in flux, this RFC does not try to provide a mangling based on the `where` clauses of the specialized `impls`. Instead it proposes to re-use the "numeric disambiguator" form already used for macro hygiene and closures. Thus, conflicting `impls` would be disambiguated via an implementation defined suffix, as in `'1::foo::MSG` and `'2::foo::MSG`. This encoding introduces minimal additional syntax and can be replaced with something more human-readable once the definition of trait specialization is final. + = [] +``` +The two symbol names would then look something like: + +``` +_RNvNvXs2_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG +_RNvNvXs3_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG + <----------><----------------><-----------------------> + impl-path self-type trait-name +``` + +Like other disambiguation information, this path would usually not actually +be shown by demanglers. ### Unicode Identifiers -Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and `_`. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ascii identifiers via [Punycode][punycode], a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string +Rust allows Unicode identifiers but our character set is restricted +to ASCII alphanumerics, and `_`. In order to transcode the former to +the latter, we use the same approach as Swift, which is: encode all +non-ASCII identifiers via [Punycode][punycode], a standardized and +efficient encoding that keeps encoded strings in a rather +human-readable format. So for example, the string ``` "Gödel, Escher, Bach" @@ -282,9 +486,14 @@ is encoded as "Gdel, Escher, Bach-d3b" ``` -which, as opposed to something like _Base64_, still gives a pretty good idea of what the original string looked like. +which, as opposed to something like _Base64_, still gives a pretty +good idea of what the original string looked like. -Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way are augmented with a `u` suffix so that demanglers know that the identifier needs further decoding. As an example, the function: +Each component of a name, i.e. anything that starts with the number +of bytes to read in the examples above, is encoded individually. +Components encoded this way are augmented with a `u` prefix so that +demanglers know that the identifier needs further decoding. As an +example, the function: ```rust mod gödel { @@ -297,19 +506,23 @@ mod gödel { would be mangled as: ``` -_RN15mycrate_4a3b56d8gdel_Fqau6escher4bachVE` - <--------> - Unicode component +_RNvNtNtC7mycrateu8gdel_Fqa6escher4bach + <--------> + Unicode component ``` ### Compression/Substitution -The length of symbol names has an influence on how much work compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and `impl Trait`). For example, the return type of the following function: +The length of symbol names has an influence on how much work the compiler, +linker, and loader have to perform. The shorter the names, the better. +At the same time, Rust's generics can lead to rather long names (which +are often not visible in the code because of type inference and +`impl Trait`). For example, the return type of the following function: ```rust -fn quux(s: Vec) -> impl Iterator { +fn quux(s: Vec) -> impl Iterator { s.into_iter() - .map(|x| x+1) + .map(|x| x + 1) .filter(|&x| x > 10) .zip(0..) .chain(iter::once((0, 0))) @@ -330,48 +543,56 @@ std::iter::Chain< std::iter::Once<(u32, usize)>> ``` -It would make for a long symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach. +It would make for a long symbol name if this type is used (maybe +repeatedly) as a generic argument somewhere. C++ has the same problem +with its templates; which is why the Itanium mangling introduces the +concept of compression. If a component of a definition occurs more than +once, it will not be repeated and instead be emitted as a substitution +marker that allows to reconstruct which component it refers to. The +scheme proposed here will use the same approach (but with a simpler +definition). -The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of "substitutions", that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of absolute paths (including the entire path itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here's an example in order to illustrate the concept: +The exact scheme will be described in detail in the reference level +explanation below but it roughly works as follows: As a mangled symbol +name is being built, we remember the position of every substitutable item +in the output string, that is, we keep track of things a subsequent +occurrence of which could be replaced by a back reference. -``` +The things that are eligible for substitution are (1) all prefixes of +paths (including the entire path itself), (2) all types except for +basic types, and (3) instances of const data. - std::iter::Chain, std::vec::IntoIter>> -$0: --- -$1: --------- -$2: ---------------- -$3: -------------- -$4: -------- -$5: ------------------ -$6: ----------------------- -$7: ---------------------------------------------------------------- -$8: ---------------------------------------------------------------------------------- -``` - -The indices on the left are the dictionary keys. The prefixes `std`, `std::iter`, and `std::iter::Chain` all get added to the dictionary because we have not seen them before. After that we encounter `std` again. We've already seen it, so we don't add anything to the dictionary. The same goes for when we encounter `std::iter` the second time. Next we encounter `std::iter::Zip`, which we have not seen before, so it's added to the dictionary. Next we encounter `std` again (already seen, no insertion), then `std::vec` and `std::vec::IntoIter` which both generate a new entry. Next we see `std::vec::IntoIter`, the first full _type_. It generates an entry too. The second type parameter is the same as the first. No part of it introduces a new entry. After the next `>` we have completely processed `std::iter::Zip, std::vec::IntoIter>`, which adds another type entry. Finally, the full `std::iter::Chain, std::vec::IntoIter>>` adds another entry. - -Using the dictionary above, we can compress to: +Here's an example in order to illustrate the concept. The name ``` -std::iter::Chain<$1::Zip<$0::vec::IntoIter, $6>> +std::iter::Chain, std::vec::IntoIter>> ``` -A couple of things to note: - - - The first occurrence of a dictionary entry is never substituted. We don't store the dictionary anywhere and need to be able to reconstruct it from the compressed version. - - Longer substitutions are preferred to shorter ones. `std::iter::Chain<$1::Zip<$0::vec::IntoIter, $4::IntoIter>>` would also decompress to the original version but the compiler is supposed to always pick the longest substitution available. - -The mangled version of a substitution marker is `S _` (and `S_` for key `0`) like in the Itanium mangling. So the above definition would be mangled to: +is mangled to the following uncompressed string. The lines below show parts +of the mangled string that already occurred before and can thus be replaced +by a back reference. The number of at the beginning of each span given +the 0-based byte position of where it occurred the first time. ``` -_RN12std_a1b2c3d44iter5ChainINS0_3ZipINS_3vec8IntoIterIjEES5_EEE +0 10 20 30 40 50 60 70 80 90 +_RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE + 7---- 7---- 7---- + 5----------- 45--------- + 43-------------------- + 42----------------------- ``` -The uncompressed version would be: +The compiler is always supposed to use the longest replacement possible +in order to achieve the best compression. The compressed symbol looks +as follows: + ``` -_RN12std_a1b2c3d44iter5ChainIN12std_a1b2c3d44iter3ZipIN12std_a1b2c3d43vec8IntoIterIjEEN12std_a1b2c3d43vec8IntoIterIjEEEEE +_RINtNtC3std4iter5ChainINtB4_3ZipINtNtB6_3vec8IntoItermEBv_EE + ^^^ ^^^ ^^^ back references ``` +Back references have the form `B_`. + # Reference-level explanation [reference-level-explanation]: #reference-level-explanation @@ -381,7 +602,10 @@ The reference-level explanation consists of three parts: 2. A specification of the compression scheme. 3. A mapping of Rust entities to the mangling syntax. -For implementing a demangler, only the first two sections are of interest, that is, a demangler only needs to understand syntax and compression of names, but it does not have to care about how the compiler generates mangled names. +For implementing a demangler, only the first two sections are of +interest, that is, a demangler only needs to understand syntax and +compression of names, but it does not have to care about how the +compiler generates mangled names. ## Syntax Of Mangled Names @@ -391,41 +615,77 @@ The syntax of mangled names is given in extended Backus-Naur form: - Non-terminals are within angle brackets (as in ``) - Terminals are within quotes (as in `"_R"`), - Optional parts are in brackets (as in `[]`), - - Repetition (zero or more times) is signified by curly braces (as in `{ }`) + - Repetition (zero or more times) is signified by curly braces (as in `{}`) - Comments are marked with `//`. Mangled names conform to the following grammar: ``` // The specifies the encoding version. - = "_R" [] [] - - = "N" [] "E" - | - - = - | "M" - | "X" [] - | - | + = "_R" [] [] + + = "C" // crate root + | "M" // (inherent impl) + | "X" // (trait impl) + | "Y" // (trait definition) + | "N" // ...::ident (nested path) + | "I" {} "E" // ... (generic args) + | + +// Path to an impl (without the Self type or the trait). +// The is the parent, while the distinguishes +// between impls in that same parent (e.g. multiple impls in a mod). +// This exists as a simple way of ensure uniqueness, and demanglers +// don't need to show it (unless the location of the impl is desired). + = [] // The is the length of the identifier in bytes. // is the identifier itself and must not start with a decimal digit. // If the "u" is present then is Punycode-encoded. -// "V" and "C" are the tags for value and closure namespaces respectively. - = ["u"] ["V"|"C"] [] + = [] + = "s" + = ["u"] + +// Namespace of the identifier in a (nested) path. +// It's an a-zA-Z character, with a-z reserved for implementation-internal +// disambiguation categories (and demanglers should never show them), while +// A-Z are used for special namespaces (e.g. closures), which the demangler +// can show in a special way (e.g. `NC...` as `...::{closure}`), or just +// default to showing the uppercase character. + = "C" // closure + | "S" // shim + | // other special namespaces + | // internal namespaces + + = + | + | "K" // forward-compat for const generics + +// An anonymous (numbered) lifetime, either erased or higher-ranked. +// Index 0 is always erased (can show as '_, if at all), while indices +// starting from 1 refer (as de Bruijn indices) to a higher-ranked +// lifetime bound by one of the enclosing s. + = "L" + +// Specify the number of higher-ranked (for<...>) lifetimes to bound. +// can then later refer to them, with lowest indices for +// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`, +// any s in ... (but not inside more binders) will observe +// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively. + = "G" = - | // named type - | "A" [] // [T; N] - | "T" {} "E" // (T1, T2, T3, ...) - | "R" // &T - | "Q" // &mut T - | "P" // *const T - | "O" // *mut T - | "G" "E" // generic parameter name - | - | + | // named type + | "A" // [T; N] + | "S" // [T] + | "T" {} "E" // (T1, T2, T3, ...) + | "R" [] // &T + | "Q" [] // &mut T + | "P" // *const T + | "O" // *mut T + | "F" // fn(...) -> ... + | "D" // dyn Trait + Send + 'a + | = "a" // i8 | "b" // bool @@ -447,56 +707,83 @@ Mangled names conform to the following grammar: | "x" // i64 | "y" // u64 | "z" // ! + | "p" // placeholder (e.g. for generic params), shown as _ // If the "U" is present then the function is `unsafe`. -// If the "J" is present then it is followed by the return type of the function. - := "F" ["U"] [] {} ["J" ] "E" - - = "K" ( - "d" | // Cdecl - "s" | // Stdcall - "f" | // Fastcall - "v" | // Vectorcall - "t" | // Thiscall - "a" | // Aapcs - "w" | // Win64 - "x" | // SysV64 - "k" | // PtxKernel - "m" | // Msp430Interrupt - "i" | // X86Interrupt - "g" | // AmdGpuKernel - "c" | // C - "x" | // System - "r" | // RustCall - "j" | // RustInstrinsic - "p" | // PlatformInstrinsic - "u" // Unadjusted - ) +// The return type is always present, but demanglers can +// choose to omit the ` -> ()` by special-casing "u". + := ["U"] ["K" ] {} "E" + + = "C" + | + + = {} "E" + = {} + = "p" + = + | "p" // placeholder (e.g. for polymorphic constants), shown as _: T + | + +// The encoding of a constant depends on its type, currently only +// unsigned integers (mainly usize, for arrays) are supported, and they +// use their value, in base 16 (0-9a-f), not their memory representation.. +// +// Note that while exposing target-specific data layout information, such +// as pointer size, endianness, etc. should be avoided as much as possible, +// it might become necessary to include raw bytes, even whole allocation +// subgraphs (that miri created), for const generics with non-trivial types. +// +// However, demanglers could just show the raw encoding without trying to +// turn it into expressions, unless they're part of e.g. a debugger, with +// more information about the target data layout and/or from debuginfo. + = {} "_" // uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and // 'Z' is decimal 61. - = "s" [] "_" - - = "I" {} "E" +// "_" with no digits indicates the value 0, while any other value is offset +// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc. + = {<0-9a-zA-Z>} "_" - = "S" [] "_" + = "B" -// We use here, so that we don't have to add a special rule for -// compression. In practice, only is expected. - := +// We use here, so that we don't have to add a special rule for +// compression. In practice, only a crate root is expected. + = ``` +### Namespace Tags + +Namespaces are identified by an implementation defined single character tag +(the `` production). Only closures (`C`) and shims (`S`) have a +specific character assigned to them so that demanglers can reliable +adjust their output accordingly. Other namespace tags have to be omitted +or shown verbatim during demangling. + +This is a concession to the compiler's current implementation. While the +language only knows two namespaces (the type and the value namespace), the +compiler uses many more in some important data structures and disambiguation +indices are assigned according to these internal data structures. So, in +order not to force the compiler to waste processing time on re-constructing +different disambiguation indices, the internal unspecified "namespaces" are +used. This may change in the future. + + ### Punycode Identifiers -Punycode generates strings of the form `([[:ascii:]]+-)?[[:alnum:]]+`. This is problematic for two reasons: +Punycode generates strings of the form `([[:ascii:]]+-)?[[:alnum:]]+`. +This is problematic for two reasons: -- Generated strings can contain a `-` character; which is not in the supported character set. -- Generated strings can start with a digit; which makes them clash with the byte-count prefix of the `` production. +- Generated strings can contain a `-` character; which is not in the + supported character set. +- Generated strings can start with a digit; which makes them clash + with the byte-count prefix of the `` production. For these reasons, vanilla Punycode string are further encoded during mangling: - The `-` character is simply replaced by a `_` character. -- The part of the Punycode string that encodes the non-ASCII characters is a base-36 number, using `[a-z0-9]` as its "digits". We want to get rid of the decimal digits in there, so we simply remap `0-9` to `A-J`. +- The part of the Punycode string that encodes the non-ASCII characters + is a base-36 number, using `[a-z0-9]` as its "digits". We want to get + rid of the decimal digits in there, so we simply remap `0-9` to `A-J`. Here are some examples: @@ -508,131 +795,111 @@ Here are some examples: | 🤦 | fq9h | fqJh | | ρυστ | 2xaedc | Cxaedc | -With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling. +With this post-processing in place the Punycode strings can be treated +like regular identifiers and need no further special handling. ## Compression -From a high-level perspective symbol name compression works by substituting parts of the mangled name that have already been seen for a back-reference. Which parts are eligible for substitution is defined via the AST of the name (as described in the previous section). Before going into the actual algorithm, let's define some terms: - -- Two AST nodes are *equivalent* if they contain the same information. In general this means that two nodes are equivalent if the sub-trees they are the root of are equal. However, there is another condition that can make two nodes equivalent. If a node `N` has a single child node `C` and `N` does not itself add any new information, then `N` and `C` are equivalent too. The exhaustive list of these special cases is: - - - `` nodes without a `` child. These are equivalent to their `` child node. - - - `` nodes with a single `` child. These are equivalent to their child node. - - - `` nodes with a single `` child. These too are equivalent to their child node. - - Equivalence is transitive, so given, for example, an AST of the form - - ``` - - | - v - - | - v - - ``` - - then the `` node is equivalent to the `` node. - - - A *substitutable* AST node is any node with a `` on the right-hand side of the production. Thus the exhaustive list of substitutable node types is: ``, ``, and ``. There is one exception to this rule: nodes that are *equivalent* to a `` node, are not *substitutable*. - - - The "substitution dictionary" is a mapping from *substitutable* AST nodes to integer indices. - -Given these definitions, compression is defined as follows. - - - Initialize the substitution dictionary to be empty. - - Traverse and modify the AST as follows: - - When encountering a substitutable node `N` there are two cases - 1. If the substitution dictionary already contains an *equivalent* node, replace the children of `N` with a `` that encodes the substitution index taken from the dictionary. - 2. Else, continue traversing through the child nodes of `N`. After the child nodes have been traversed and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its key. - -The following gives an example of substitution index assignment and node replacements for `foo::Bar::quux` (with `quux` being an inherent method of `foo::Bar`). `#n` designates that the substitution index `n` was assigned to the given node and `:= #n` designates that it is replaced with a ``: - +Symbol name compression works by substituting parts of the mangled +name that have already been seen for a back reference. Compression +is directly built into the mangling algorithm, as shown by the +following piece of pseudocode: +```rust +fn mangle(node, output_string, substitution_dictionary) { + if let Some(backref) = substitution_dictionary.get(node) { + // Emit the backref instead of the node's contents + mangle(backref, output_string) + } else { + // Remember where the current node starts in the output + let start_position = output_string.len() + + // Do the actual mangling, including recursive mangling of child nodes + + // Add the current node to the substitution dictionary + if node.is_substitutable() { + substitution_dictionary.insert(node, start_position) + } + } +} ``` - - | - #3 - / \ - #2 - / \ | - := #1 - / | - - | | - - | / \ - #1 - / \ / - #0 - | - -``` - -Some interesting things to note in this example: - - There are substitutable nodes that are not replaced, nor added to the dictionary. This falls out of the equivalence rule. The node marked with `#1` is equivalent to its three immediate ancestors, so no dictionary entries are generated for those. +This algorithm automatically chooses the best compression because +parent nodes (which are always larger) are visited before child +nodes. - - The `` node marked with `:= #1` is replaced by `#1`, which is not a `` but an (equivalent) ``. This is OK and prescribed by the algorithm. The definition of equivalence ensures that there is only one valid way to construct a `` node from a `` node. +Note that this kind of compression relies on the fact that all +substitutable AST nodes have a self-terminating mangled form, +that is, given the start position of the encoded node, the +grammar guarantees that it is always unambiguous where the +node ends. This is ensured by not allowing optional or +repeating elements at the end of substitutable productions. -## Decompression +### Decompression -Decompression works analogously to compression, only this time, the substitution dictionary maps substitution indices to nodes instead of the other way round: +Decompression too is built directly into demangling/parsing. When a back +reference is encountered, we decode the referenced position and use a +temporary demangler/parser to do the decoding of the node's actual content: - - Initialize the substitution dictionary to be empty. - - Traverse and modify the AST as follows: - - When encountering a substitutable node `N` there are two cases - 1. If the node has a single `` child, extract the substitution index from it and replace the node with the corresponding entry from the substitution dictionary. - 2. Else, continue traversing the child nodes of the current node. After the child nodes have been traversed, and if the dictionary does not yet contain an *equivalent* node, then allocate the next unused substitution index and add it to the substitution dictionary with `N` as its value. - -This is what the example from above looks like for decompression: - -``` - - | - #3 - / \ - #2 - / \ | - := #1 - / | - - | - - | - #1 - / \ - #0 - | - +```rust +fn demangle_at(&mut pos, mangled, output_string) { + if is_backref(*pos, mangled) { + // Read the byte offset of the referenced node and + // advance `pos` past the backref. + let mut referenced_pos = decode(pos, mangled); + demangle_at(&mut referenced_pos, mangled, output_string) + } else { + // do regular demangling + } +} ``` +Using byte offsets as backref keys (as this RFC does) instead of post-order +traversal indices (as Itanium mangling does) has the advantage that the +demangler does not need to duplicate the mangler's substitution indexing logic, +something that can become quite complex (as demonstrated by the compression +scheme proposed in the initial version of this RFC). + ### A Note On Implementing Efficient Demanglers -The mangling syntax is constructed in a way that allows for implementing efficient demanglers: +The mangling syntax is constructed in a way that allows for implementing +efficient demanglers: - - Mangled names contain information in the same order as unmangled names are expected to contain it. Therefore, a demangler can directly generate its output while parsing the mangled form. There is no need to explicitly instantiate the AST in memory. + - Mangled names contain information in the same order as unmangled + names are expected to contain it. Therefore, a demangler can directly + generate its output while parsing the mangled form. There is no need + to explicitly instantiate the AST in memory. - - The same is true for decompression. The demangler can keep a simple array that maps substitution indices to ranges in the already generated output. When it encounters a `` in need of expansion, it can just look up corresponding range and do a simple `memcpy`. + - The same is true for decompression. Decompression can be done without + allocating memory outside of the stack. Alternatively the demangler + can keep a simple array that maps back-ref indices to ranges in the + already generated output. When it encounters a `` in need + of expansion, it can just look up corresponding range and do a + simple `memcpy`. -Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need to do dynamic allocation except for the dictionary array. +Parsing, decompression, and demangling can thus be done in a single pass +over the mangled name without the need for complex data structures, which +is useful when having to implement `#[no_std]` or C demanglers. ## Mapping Rust Language Entities to Symbol Names This RFC suggests the following mapping of Rust entities to mangled names: -- Free standing named functions and types shall be represented by an `` production. +- Named functions, methods, and statics shall be represented by a + `` production. -- Absolute paths should be rooted at the inner-most entity that can act as a path root. Roots can be crate-ids, types (for entities with an inherent impl in their path), and trait impls (for entities with trait impls in their path). +- Paths should be rooted at the inner-most entity that can act + as a path root. Roots can be crate-ids, inherent impls, trait impls, and + (for items within default methods) trait definitions. -- The disambiguation index for an identifier in the type, value, and closure namespaces is determined by counting the number of occurrences of that identifier within its parent context (i.e. the fully macro-expanded AST). The disambiguation index `0` is represented by omitting the `` production (which should be the common case). +- The compiler is free to choose disambiguation indices and namespace tags from + the reserved ranges as long as it ascertains identifier unambiguity. -- The compiler is free to choose disambiguation indices for specializing trait impls. Disambiguation indices do not need to be densely packed. In particular the compiler can use arbitrary hashes to disambiguate specializing trait impls. +- Generic arguments that are equal to the default should not be encoded in + order to save space. # Drawbacks @@ -640,8 +907,11 @@ This RFC suggests the following mapping of Rust entities to mangled names: Why should we *not* do this? -- The scheme is complex, especially due to compression---albeit less complex than prior art and probably not more complex than the current scheme, if we were to describe that formally. -- The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current mangling scheme is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information. +- The current/legacy scheme based on symbol-hashes is flexible in that + hashes can be changed at will. That is, the unstable part of the + current mangling scheme is nicely contained and does not keep breaking + external tools. The danger of breakage is greater with the scheme + proposed here because it exposes more information. # Rationale and alternatives @@ -649,29 +919,50 @@ Why should we *not* do this? The alternatives considered are: - 1. Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers. - - 2. Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments. - - 2. Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long. - - 3. Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the Itanium inspired scheme proposed above. - - 4. Define a standardized pretty-printing format for things that end up as symbols, compress with zstd (specially trained for Rust symbols) and encode the result as base63. This is rather simple but loses all human-readability. It's unclear how well this would compress. It would pull the zstd specification into the mangling scheme specification, as well as the pre-trained dictionary. - -The Itanium mangling (and by extension the scheme proposed here) could be considered somewhat arcane. But it is well-known from C++ and provides a good trade-off between readability, complexity, and length of generated symbols. + 1. Keeping the current scheme. It does meet the minimum requirements + after all. However, the general consensus seems to be that this + leads to situations where people are unpleasantly surprised when + they come across (demangled) symbol names in backtraces or profilers. + + 2. Keeping the current scheme but cleaning it up by making the non-hash + part more consistent and more expressive. Keep the hash part as a + safeguard against symbol conflicts and the rest as something just + for demangling. The downside of this is that the hash would still + not be predictable, and symbols would get rather long if they should + contain more human-readable information about generic arguments. + + 3. Define a standardized pretty-printing format for things that end up + as symbols, and then encode that via Punycode in order to meet the + character set restrictions. This would be rather simple. Symbol names + would remain somewhat human-readable (but not very, because all + separators would be stripped out). But without some kind of additional + compression, symbol names would become rather long. + + 4. Use the scheme from the previous bullet point but apply the compression + scheme described above. We could do this but it wouldn't really be less + complex than the scheme proposed by the RFC. + + 5. Define a standardized pretty-printing format for things that end up as + symbols, compress with `zstd` (specially trained for Rust symbols) and + encode the result as `base63`. This is rather simple but loses all + human-readability. It's unclear how well this would compress. It would + pull the `zstd` specification into the mangling scheme specification, + as well as the pre-trained dictionary. # Prior art [prior-art]: #prior-art -The mangling scheme described here is an adaptation of the [Itanium C++ ABI][itanium-mangling] scheme, -which is the scheme used by the GCC toolchain (and clang when it's not compiling for MSVC). - -One notable improvement the proposed scheme makes upon Itanium mangling is explicit handling of -unicode identifiers. The idea of using [Punycode][punycode] for this is taken from the -[Swift][swift-gh] programming language's [mangling scheme][swift-mangling] (which is also based on -Itanium mangling). +One of the major modern mangling schemes with a public specification is the +[Itanium C++ ABI][itanium-mangling] scheme for C++ which is used by the GCC +toolchain. An initial version of this RFC sticked closely to Itanium mangling, +however, the latest version only retains the run-length encoding for +identifiers and some literals for tagging things like basic types. The +Itanium scheme has been criticized for being overly complex, due to its +extensive grammar and two separate compression schemes. +The idea of using [Punycode][punycode] for handling of unicode identifiers +is taken from the [Swift][swift-gh] programming language's +[mangling scheme][swift-mangling]. [punycode]: https://tools.ietf.org/html/rfc3492 [itanium-mangling]: http://refspecs.linuxbase.org/cxxabi-1.86.html#mangling @@ -682,42 +973,95 @@ Itanium mangling). # Unresolved questions [unresolved-questions]: #unresolved-questions +### Complex Constant Data + +The RFC encodes constant values via the ` = {} "_"` +production, where `{}` is the numeric value of the constant, not +its representation as bytes. Using the numeric value is platform independent +but does not easily scale to non-integer data types. + +It is unclear if this is something that needs to be resolved now or can +be left for a future version of the mangling scheme. + + ### Punycode vs UTF-8 -During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision: +During the pre-RFC phase, it has been suggested that Unicode identifiers +should be encoded as UTF-8 instead of Punycode on platforms that allow it. +GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard +time making up their mind about this issue. Here are some interesting +points that might influence the final decision: + +- Using UTF-8 instead of Punycode would make mangled strings containing + non-ASCII identifiers a bit more human-readable. For demangled strings, + there would be no difference. + +- Punycode support is non-optional since some platforms only allow a very + limited character set for symbol names. Thus, we would be using UTF-8 on + some platforms and Punycode on others, making it harder to predict what a + symbol name for a given item looks like. -- Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference. +- Punycode encoding and decoding is more runtime effort for the mangler + and demangler. -- Punycode support is non-optional since some platforms only allow a very limited character set for symbol names. Thus, we would be using UTF-8 on some platforms and Punycode on others, making it harder to predict what a symbol name for a given item looks like. +- Once a demangler supports Punycode, it is not much effort to support + both encodings. The `u` identifier prefix tells the demangler whether + it is Punycode. Otherwise it can just assume UTF-8 which already + subsumes ASCII. -- Punycode encoding and decoding is more runtime effort for the mangler and demangler. -- Once a demangler supports Punycode, it is not much effort to support both encodings. The `u` identifier suffix tells the demangler whether it's Punycode. Otherwise it can just assume UTF-8 which already subsumes ASCII. +### Encoding parameter types for function symbols -### Re-use for crate disambiguator +It has been suggested that parameter types for functions and methods should +be encoded in mangled form too. This is not necessary for symbol name +uniqueness but it would provide an additional safeguard against silent +ABI-related errors where definition and callers of some function make +different assumptions about what parameters a function takes. The RFC +does not propose to do this because: + + - Rust makes sure this cannot happen via crate metadata, + - it would make symbol names longer, and + - only some but not all ABI related errors are caught by the safeguard. + +However, a final decision on the topic has not been made yet. -The RFC currently proposes to represent crate-ids as an `` of the form `_`. However, the `` production already supports disambiguation via its `` component. The crate disambiguator could be encoded into an disambiguation index. # Appendix A - Suggested Demangling -This RFC suggests that names are demangled to a form that matches Rust syntax as it is used in source code, compiler error messages and `rustdoc`: +This RFC suggests that names are demangled to a form that matches Rust +syntax as it is used in source code, compiler error messages and `rustdoc`: - Path components should be separated by `::`. - - If the path root is a `` it should be printed as the crate name. If the context requires it for correctness, the crate disambiguator should be printed too, as in, for example, `std[a0b1c2d3]::collections::HashMap`. In this case `a0b1c2d3` would be the disambiguator. Usually, the disambiguator can be omitted for better readability. + - If the path root is a `` it should be printed as the crate name. + If the context requires it for correctness, the crate disambiguator can be + printed too, as in, for example, `std[a0b1c2d3]::collections::HashMap`. + In this case `a0b1c2d3` would be the disambiguator. Usually, the + disambiguator can be omitted for better readability. - - If the path root is a trait impl, it should be printed as ``, like the compiler does in error messages. + - If the path root is an impl, it should be printed as `` (for + inherent impls) or `` (for trait impls), like the + compiler does in error messages. The `` also contained in the + AST node should usually be omitted. - The list of generic arguments should be demangled as ``. - - Identifiers and trait impl path roots can have a numeric disambiguator (the `` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s_` then the index is ` + 2`. The suggested demangling of a disambiguator is `[]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted. - - The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[]`. + - Identifiers can have a numeric disambiguator + (the `` production). The syntactic version of the numeric + disambiguator maps to a numeric index. If the disambiguator is not + present, this index is 0. If it is of the form `s_` then the index is 1. + If it is of the form `s_` then the index is + ` + 2`. The suggested demangling of a disambiguator is + `[]`. However, for better readability, these disambiguators + should usually be omitted in the demangling altogether. Disambiguators + with index zero can _always_ be omitted. - - In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted. + The exception here are closures. Since these do not have a name, the + disambiguator is the only thing identifying them. The suggested + demangling for closures is thus `{closure}[]`. -# Appendix B - Interesting Examples +# Appendix B - Examples -We assume that all examples are defined in a crate named `mycrate[xxx]`. +We assume that all examples are defined in a crate named `mycrate[1234]`. ### Free-standing Item @@ -730,7 +1074,7 @@ mod foo { } ``` - unmangled: `mycrate::foo::bar::baz` -- mangled: `_RN11mycrate_xxx3foo3bar3bazVE` +- mangled: `_RNvNtNtCs1234_7mycrate3foo3bar3baz` ### Item Defined In Inherent Method @@ -745,42 +1089,29 @@ impl Foo { } } ``` -- unmangled: `mycrate::Foo::bar::QUUX` -- mangled: `_RNMN11mycrate_xxx3FooE3barV4QUUXVE` +- unmangled: `>::bar::QUUX` +- mangled: `_RNvNvMCs1234_7mycrateINtCs1234_7mycrate3FoopE3bar4QUUX` -### Item Defined In Trait Method - -```rust -struct Foo(T); - -impl Clone for Foo { - fn clone(_: U) { - static QUUX: u32 = 0; - // ... - } -} -``` -- unmangled: `::clone::QUUX` -- mangled: `_RNXN11mycrate_xxx3FooEN7std_yyy5clone5CloneE5cloneV4QUUXVE` +### Item Defined In Trait Method -### Item Defined In Specializing Trait Impl ```rust struct Foo(T); impl Clone for Foo { - default fn clone(_: U) { + fn clone(&self) -> Self { static QUUX: u32 = 0; // ... } } ``` -- unmangled: `[1234]::clone::QUUX` -- mangled: `_RNXN11mycrate_xxx3FooEN7std_yyy5clone5CloneEsjU_5cloneV4QUUXVE` +- unmangled: ` as std::clone::Clone>::clone::QUUX` +- mangled: `_RNvNvXCs1234_7mycrateINtCs1234_7mycrate3FoopENtNtC3std5clone5Clone5clone4QUUX` ### Item Defined In Initializer Of A Static + ```rust pub static QUUX: u32 = { static FOO: u32 = 1; @@ -788,14 +1119,28 @@ pub static QUUX: u32 = { }; ``` - unmangled: `mycrate::QUUX::FOO` -- mangled: `_RN11mycrate_xxx4QUUXV3FOOVE` +- mangled: `_RNvNvCs1234_7mycrate4QUUX3FOO` -### Compressed Prefix Constructed From Prefix That Contains A Substitution Itself -- unmangled: `std[xxx]::foo` -- mangled: `_RN7std_xxx3fooFINS_3barFENS1_3bazFEEE` +### Compressed Prefix Constructed From Prefix That Contains A Substitution Itself - TODO +- unmangled: `mycrate::foo` +- mangled: `_RINvCs1234_7mycrate3fooNvB4_3barNvBn_3bazE` ### Progressive type compression -- unmangled: `std[xxx]::foo<(std[xxx]::Bar,std[xxx]::Bar),(std[xxx]::Bar,std[xxx]::Bar)>` -- mangled: `_RN7std_xxx3fooITNS_3BarES1_ES2_EE` +- unmangled: `std::foo<(std::Bar,std::Bar),(std::Bar,std::Bar)>` +- mangled: `_RINxC3std3fooTNyB4_3BarBe_EBd_E` + + +# Appendix C - Change LOG +- Removed mention of Itanium mangling in introduction. +- Weakened "predictability" goal. +- Removed non-goal of not providing a mangling for lifetimes. +- Added non-goal for not trying to standardize the demangled form. +- Updated specification and examples to new grammar as proposed by eddyb. +- `impl` disambiguation strategy changed to using the impl path instead of param bounds. +- Updated prior art section to not say this RFC is an adaptation of Itanium mangling. +- Updated compiler's expected assignment of disambiguation indices and namespace tags. +- Removed "complexity" drawback since the scheme is not very complex anymore. +- Removed unresolved question "Re-use `` for crate disambiguator". +- Added note about default generic arguments to reference-level-explanation. \ No newline at end of file From 29d5c93be1c0567b6d035dc2b52a8c2964cf11ac Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Tue, 16 Apr 2019 10:03:40 +0200 Subject: [PATCH 13/18] Add note about Punycode making decoding more complicated. --- text/0000-symbol-name-mangling-v2.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index faba73376a7..8273c52f9e0 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -881,8 +881,10 @@ efficient demanglers: Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need for complex data structures, which -is useful when having to implement `#[no_std]` or C demanglers. - +is useful when having to implement `#[no_std]` or C demanglers. (Note that +Punycode can complicate decoding slightly because it needs dynamic memory +allocation in the general case but it can be implemented with an on-stack +buffer for a reasonable maximum supported length). ## Mapping Rust Language Entities to Symbol Names @@ -1143,4 +1145,5 @@ pub static QUUX: u32 = { - Updated compiler's expected assignment of disambiguation indices and namespace tags. - Removed "complexity" drawback since the scheme is not very complex anymore. - Removed unresolved question "Re-use `` for crate disambiguator". -- Added note about default generic arguments to reference-level-explanation. \ No newline at end of file +- Added note about default generic arguments to reference-level-explanation. +- Added note about Punycode making decoding more complicated. From 8d7673184be3c33efb6bb4e1c1efa10111da420a Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Tue, 16 Apr 2019 10:19:58 +0200 Subject: [PATCH 14/18] Update some wording as suggested by eddyb. --- text/0000-symbol-name-mangling-v2.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 8273c52f9e0..188a27aa405 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -447,8 +447,8 @@ the `impl` to the symbol name. ``` = C // crate-id root - | M // inherent impl root - | X // trait impl root + | M // inherent impl root + | X // trait impl root | N // nested path | I {} E // generic arguments @@ -560,7 +560,8 @@ occurrence of which could be replaced by a back reference. The things that are eligible for substitution are (1) all prefixes of paths (including the entire path itself), (2) all types except for -basic types, and (3) instances of const data. +basic types, and (3) type-level constants (array lengths and values +passed to const generic params). Here's an example in order to illustrate the concept. The name @@ -574,12 +575,12 @@ by a back reference. The number of at the beginning of each span given the 0-based byte position of where it occurred the first time. ``` -0 10 20 30 40 50 60 70 80 90 + 0 10 20 30 40 50 60 70 80 90 _RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE - 7---- 7---- 7---- - 5----------- 45--------- - 43-------------------- - 42----------------------- + 5---- 5---- 5---- + 3----------- 43--------- + 41-------------------- + 40----------------------- ``` The compiler is always supposed to use the longest replacement possible @@ -587,7 +588,7 @@ in order to achieve the best compression. The compressed symbol looks as follows: ``` -_RINtNtC3std4iter5ChainINtB4_3ZipINtNtB6_3vec8IntoItermEBv_EE +_RINtNtC3std4iter5ChainINtB2_3ZipINtNtB4_3vec8IntoItermEBt_EE ^^^ ^^^ ^^^ back references ``` From ffd441fd20d711fb2697cee61451385b70a55627 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Tue, 16 Apr 2019 10:27:38 +0200 Subject: [PATCH 15/18] Resolve question of const generics by delegating the question to a future version. --- text/0000-symbol-name-mangling-v2.md | 35 +++++++++++----------------- 1 file changed, 14 insertions(+), 21 deletions(-) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 188a27aa405..1407684e19d 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -727,16 +727,7 @@ Mangled names conform to the following grammar: // The encoding of a constant depends on its type, currently only // unsigned integers (mainly usize, for arrays) are supported, and they -// use their value, in base 16 (0-9a-f), not their memory representation.. -// -// Note that while exposing target-specific data layout information, such -// as pointer size, endianness, etc. should be avoided as much as possible, -// it might become necessary to include raw bytes, even whole allocation -// subgraphs (that miri created), for const generics with non-trivial types. -// -// However, demanglers could just show the raw encoding without trying to -// turn it into expressions, unless they're part of e.g. a debugger, with -// more information about the target data layout and/or from debuginfo. +// use their value, in base 16 (0-9a-f), not their memory representation. = {} "_" // uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and @@ -768,6 +759,18 @@ order not to force the compiler to waste processing time on re-constructing different disambiguation indices, the internal unspecified "namespaces" are used. This may change in the future. +### Type-Level Constants + +As described above, the grammar encodes constant values via the +` = {} "_"` production, where `{}` is +the numeric value of the constant, not its representation as bytes. Using +the numeric value is platform independent but does not easily scale to +non-integer data types. + +In the future it is likely that Rust will support complex type-level +constants (i.e. not just integers). This RFC suggests to develop a +proper mangling for these as part of the future const-generics work, +and, for now, only define a mangling for integer values. ### Punycode Identifiers @@ -976,17 +979,6 @@ is taken from the [Swift][swift-gh] programming language's # Unresolved questions [unresolved-questions]: #unresolved-questions -### Complex Constant Data - -The RFC encodes constant values via the ` = {} "_"` -production, where `{}` is the numeric value of the constant, not -its representation as bytes. Using the numeric value is platform independent -but does not easily scale to non-integer data types. - -It is unclear if this is something that needs to be resolved now or can -be left for a future version of the mangling scheme. - - ### Punycode vs UTF-8 During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. @@ -1148,3 +1140,4 @@ pub static QUUX: u32 = { - Removed unresolved question "Re-use `` for crate disambiguator". - Added note about default generic arguments to reference-level-explanation. - Added note about Punycode making decoding more complicated. +- Resolve question of complex constant data. From f8e3ab65f34e0e7e5721e945d0d96d4c59861bb9 Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Tue, 16 Apr 2019 10:35:56 +0200 Subject: [PATCH 16/18] Add a recommended resolution for open question around Punycode identifiers. --- text/0000-symbol-name-mangling-v2.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index 1407684e19d..e2b08353dcd 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -1003,6 +1003,10 @@ points that might influence the final decision: it is Punycode. Otherwise it can just assume UTF-8 which already subsumes ASCII. +**UPDATE**: This RFC recommends that Punycode encoded identifiers must +be supported by demanglers but that it is up to the compiler implementation +(for now) to decide whether to use it for a given platform. This question +will have to be revisited if Rust ever wants to define a stable ABI. ### Encoding parameter types for function symbols @@ -1141,3 +1145,4 @@ pub static QUUX: u32 = { - Added note about default generic arguments to reference-level-explanation. - Added note about Punycode making decoding more complicated. - Resolve question of complex constant data. +- Add a recommended resolution for open question around Punycode identifiers. From 30b477aa31c19888285001bc586c26d7d02c722e Mon Sep 17 00:00:00 2001 From: Michael Woerister Date: Tue, 16 Apr 2019 10:42:02 +0200 Subject: [PATCH 17/18] Add a recommended resolution for open question around encoding function parameter types. --- text/0000-symbol-name-mangling-v2.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/0000-symbol-name-mangling-v2.md index e2b08353dcd..48cf296f57f 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/0000-symbol-name-mangling-v2.md @@ -1023,6 +1023,13 @@ does not propose to do this because: However, a final decision on the topic has not been made yet. +**UPDATE**: This RFC suggests that parameter types are *not* encoded into +function and method symbols. Symbol names will already get significantly +longer due to encoding additional information and the additional +safeguard provided against ABI mismatches is less relevant for Rust +than it is for other languages that don't have a concept of +library/crate metadata. + # Appendix A - Suggested Demangling @@ -1146,3 +1153,4 @@ pub static QUUX: u32 = { - Added note about Punycode making decoding more complicated. - Resolve question of complex constant data. - Add a recommended resolution for open question around Punycode identifiers. +- Add a recommended resolution for open question around encoding function parameter types. From 644025d640c822e383106f04e2a751ceeb4f3047 Mon Sep 17 00:00:00 2001 From: Mazdak Farrokhzad Date: Fri, 10 May 2019 18:03:56 +0200 Subject: [PATCH 18/18] RFC 2603 --- ...-name-mangling-v2.md => 2603-symbol-name-mangling-v2.md} | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) rename text/{0000-symbol-name-mangling-v2.md => 2603-symbol-name-mangling-v2.md} (99%) diff --git a/text/0000-symbol-name-mangling-v2.md b/text/2603-symbol-name-mangling-v2.md similarity index 99% rename from text/0000-symbol-name-mangling-v2.md rename to text/2603-symbol-name-mangling-v2.md index 48cf296f57f..e082ced2adf 100644 --- a/text/0000-symbol-name-mangling-v2.md +++ b/text/2603-symbol-name-mangling-v2.md @@ -1,7 +1,7 @@ -- Feature Name: symbol_name_mangling_v2 +- Feature Name: `symbol_name_mangling_v2` - Start Date: 2018-11-27 -- RFC PR: (leave this empty) -- Rust Issue: (leave this empty) +- RFC PR: [rust-lang/rfcs#2603](https://github.com/rust-lang/rfcs/pull/2603) +- Rust Issue: [rust-lang/rust#60705](https://github.com/rust-lang/rust/issues/60705) # Summary [summary]: #summary