-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for non-ASCII decoding in legacy demangling #65
base: main
Are you sure you want to change the base?
Conversation
🐇 is not a valid character inside symbols. Some linkers will reject it and some tools don't handle it, not just rustc-demangle. Rustc would always encode it into something like |
Unfortunately this PR is incomplete -- and perhaps more invasive than I originally thought. The existing strategy for |
s = &s[..i]; | ||
thinlto_stripped = &s[..i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note a subtle change here: previously, this would strip the thinlto data from s before assigning original to it, which could lead to the removal of symbol data before passing it to original. This seemed to be erroneous (e.g. if style is None I would expect original to be unchanged) and would truncate the remaining line if unchanged (since we must keep remainder data). It is possible to return to original behaviour if this change is unwanted.
output.write_all(mangled.as_bytes())?; | ||
// there are maybe valid symbols inside this fake one | ||
output.write_all(&line.as_bytes()[..1])?; | ||
line = &line[1..]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note again a subtle difference: this can now identify and write symbols inside what appear to be other symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that accidentally demangle parts of C++ symbols as Rust symbols? If so that would break C++ demangling for programs that try Rust demangling before C++ demangling to handle both. Trying C++ demangling first is guaranteed to work for legacy Rust symbols, but doesn't demangle the $LT$
,$C$
, ... parts.
let next_head = match (line[head..].find("_ZN"), line[head..].find("_R")) { | ||
(Some(idx), None) | (None, Some(idx)) => head + idx, | ||
(Some(idx1), Some(idx2)) => head + idx1.min(idx2), | ||
let next_head = match (line.find("_ZN"), line.find("_R")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid for OSX? This conflicts with the tests.
Does the problem as-such occur with v0 demangling? |
Haven't tested with v0 as I didn't have any samples of unicode in v0 symbols. I'll try this later. |
We would like to enable v0 mangling by default rather than continuing to focus on "improving" legacy mangling/demangling (an improvement that is somewhat questionable if other tooling that supports legacy mangling/demangling does not support or expect it), so it would be important to know how it is affected by the concern you have. |
I'll be honest, I don't entirely follow this logic 😕 Legacy mangling will be around for some time yet (even once the RFC/stabilisation lands) and this is something that is supported by GDB/LLDB. Moreover, I'm pretty sure this patch is faster than the previous behaviour. To test this, I wrote a benchmark: Benchmark Codeextern crate criterion;
extern crate rustc_demangle;
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rustc_demangle::demangle_stream;
use std::io::{BufReader, Sink};
const SOURCE: &str = include_str!("sample-mangled.txt");
fn demangle(source: &str) {
let mut reader = BufReader::new(source.as_bytes());
let mut writer = Sink::default();
demangle_stream(&mut reader, &mut writer, false).unwrap();
}
fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("demangle", |b| b.iter(|| demangle(black_box(SOURCE))));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches); Here is the sample I used (from an ongoing project in which we will soon want the changes in this PR): sample-mangled.txt Then tested with this PR:
With the original:
There are significant performance improvements associated with being more permissive here without loss of generality or correctness. |
There are some rare situations (I'm currently in one) where one needs non-ASCII demangling. This change implements non-ASCII support for legacy demangling.