Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for non-ASCII decoding in legacy demangling #65

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

addisoncrump
Copy link

There are some rare situations (I'm currently in one) where one needs non-ASCII demangling. This change implements non-ASCII support for legacy demangling.

@addisoncrump addisoncrump marked this pull request as draft September 28, 2023 01:38
@bjorn3
Copy link
Member

bjorn3 commented Sep 28, 2023

🐇 is not a valid character inside symbols. Some linkers will reject it and some tools don't handle it, not just rustc-demangle. Rustc would always encode it into something like $81$$81$$81$ (this is not the actual encoding, but just to give an idea)

@addisoncrump
Copy link
Author

addisoncrump commented Sep 28, 2023

From what I've observed, it is inconsistently handled. c++filt and llvm-cxxfilt don't demangle it. That said, GDB and LLDB do demangle it. Since the cost of this change is effectively negligible1 and there's no technical debt associated with maintaining this change (it is very small), I don't see why it is unfavourable to support the broadest set here.

In my case, this is an intentional post-processing redefinition to flag the symbols in a way that's visually recognisable. I've tested with mold, lld, and ld, and each accept these symbols. (I would note that it would be surprising if they could not support it, since the symbols are effectively copy-pasted at link-time -- and they are required to be supported for even more strange cases, like linkage with invalid UTF-8.)

Here's a screenshot from GDB with an example from one of the binaries built using our library:
image

1Actually, it might be more performant since we don't scan the whole string ahead of time + the existing check does not account for special ASCII like CR/TAB/etc anyways.

@addisoncrump addisoncrump marked this pull request as draft September 28, 2023 21:28
@addisoncrump
Copy link
Author

Unfortunately this PR is incomplete -- and perhaps more invasive than I originally thought. The existing strategy for demangle_line aggressively filters based on some assumptions about the input shape. I am refactoring this now.

@addisoncrump addisoncrump marked this pull request as ready for review September 28, 2023 22:34
Comment on lines -105 to +91
s = &s[..i];
thinlto_stripped = &s[..i];
Copy link
Author

@addisoncrump addisoncrump Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note a subtle change here: previously, this would strip the thinlto data from s before assigning original to it, which could lead to the removal of symbol data before passing it to original. This seemed to be erroneous (e.g. if style is None I would expect original to be unchanged) and would truncate the remaining line if unchanged (since we must keep remainder data). It is possible to return to original behaviour if this change is unwanted.

Comment on lines -185 to +204
output.write_all(mangled.as_bytes())?;
// there are maybe valid symbols inside this fake one
output.write_all(&line.as_bytes()[..1])?;
line = &line[1..];
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note again a subtle difference: this can now identify and write symbols inside what appear to be other symbols.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that accidentally demangle parts of C++ symbols as Rust symbols? If so that would break C++ demangling for programs that try Rust demangling before C++ demangling to handle both. Trying C++ demangling first is guaranteed to work for legacy Rust symbols, but doesn't demangle the $LT$,$C$, ... parts.

let next_head = match (line[head..].find("_ZN"), line[head..].find("_R")) {
(Some(idx), None) | (None, Some(idx)) => head + idx,
(Some(idx1), Some(idx2)) => head + idx1.min(idx2),
let next_head = match (line.find("_ZN"), line.find("_R")) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this valid for OSX? This conflicts with the tests.

@workingjubilee
Copy link
Member

workingjubilee commented Sep 29, 2023

Does the problem as-such occur with v0 demangling?

@addisoncrump
Copy link
Author

Haven't tested with v0 as I didn't have any samples of unicode in v0 symbols. I'll try this later.

@workingjubilee
Copy link
Member

workingjubilee commented Sep 29, 2023

We would like to enable v0 mangling by default rather than continuing to focus on "improving" legacy mangling/demangling (an improvement that is somewhat questionable if other tooling that supports legacy mangling/demangling does not support or expect it), so it would be important to know how it is affected by the concern you have.

@addisoncrump
Copy link
Author

addisoncrump commented Oct 2, 2023

I'll be honest, I don't entirely follow this logic 😕 Legacy mangling will be around for some time yet (even once the RFC/stabilisation lands) and this is something that is supported by GDB/LLDB. Moreover, I'm pretty sure this patch is faster than the previous behaviour. To test this, I wrote a benchmark:

Benchmark Code
extern crate criterion;
extern crate rustc_demangle;

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rustc_demangle::demangle_stream;
use std::io::{BufReader, Sink};

const SOURCE: &str = include_str!("sample-mangled.txt");

fn demangle(source: &str) {
    let mut reader = BufReader::new(source.as_bytes());
    let mut writer = Sink::default();

    demangle_stream(&mut reader, &mut writer, false).unwrap();
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("demangle", |b| b.iter(|| demangle(black_box(SOURCE))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Here is the sample I used (from an ongoing project in which we will soon want the changes in this PR): sample-mangled.txt

Then tested with this PR:

$ cargo bench --bench demangle_large --features std
   Compiling rustc-demangle v0.1.23 (/home/addisoncrump/git/rustc-demangle)
    Finished bench [optimized] target(s) in 16.58s
     Running benches/demangle_large.rs (target/release/deps/demangle_large-0592d8a86e001704)
Gnuplot not found, using plotters backend
demangle                time:   [12.683 ms 12.701 ms 12.721 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

With the original:

$ cargo bench --bench demangle_large --features std
   Compiling rustc-demangle v0.1.23 (/home/addisoncrump/git/rustc-demangle)
    Finished bench [optimized] target(s) in 16.62s
     Running benches/demangle_large.rs (target/release/deps/demangle_large-0592d8a86e001704)
Gnuplot not found, using plotters backend
demangle                time:   [19.847 ms 19.879 ms 19.917 ms]
                        change: [+56.172% +56.517% +56.900%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

There are significant performance improvements associated with being more permissive here without loss of generality or correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants