Split debug info for emscripten #9871

dschuff · 2019-11-19T23:21:55Z

Here's an overview of debug-info splitting options on ELF platforms and how we might apply them to wasm. It doesn't discuss yet how it would be implemented.

Currently LLVM supports outputting debug info in wasm object files, in the traditional GNU manner where all debug info is in the object file in several sections such as .debug_info, and linked into the executable in the same sections. For deployment, binaries can be built optimized but with debug info, and then stripped; a copy of the full binary can then be archived, and used to symbolize or debug the stripped binary.
This has the advantage that it uses the minimal number of files, the simplest compile+link flow (every build system can handle it), and the debugger needs only one file to debug the binary. It has the disadvantage that the linker must merge all the debug info on every incremental build (slowing the link), and the resulting executable is very large. This latter disadvantage is especially important for wasm, because (even when just debugging) the binary must be sent over a network (even if a fast one) and loaded into the VM (which is much more expensive than a Linux loader which just needs to mmap the sections).

GCC and LLVM for ELF targets support "split-dwarf" mode, using the -gsplit-dwarf flag (a good overview is here), which splits most of the debug info out from each object into a separate dwo file. In that case the object file has a much smaller .debug_info section and its dwo file has a large .debug_info.dwo section with most of the info. The debugger must then look up any required dwo file on demand when debugging. This mitigates both of the aforementioned disadvantages of traditional debug info. One disadvantage of this approach is that all of the dwo files must be available to debug the binary, and moving them around is annoying (many files, need to preserve directory hierarchy and pathnames). A second disadvantage is that there are now 2 output files for each C file, which makes things more challenging for build systems.

The dwp tool can be used to combine all of the debug info in dwo files into a single .dwp file that goes alongside the executable, which mitigates that first disadvantage, at the cost of invoking another tool at link time. Clang has a decent solution for the second issue in the form of a second split-dwarf variant, -gsplit-dwarf=single. This splits the debug info into the same sections as the other split-dwarf variant but puts the .debug_info.dwo section in the object file instead of a separate dwo file. The linker only links the .debug_info sections into the final executable (not the .debug_info.dwo sections), keeping the advantages of splitting without the extra build system pain. dwp works the same way.

In all these cases, there is still a small amount of debug info in the final binary, which is unfortunate but I don't know of a better way in the ELF world.

It sounds to me like using -gsplit-dwarf=single is a good goal to shoot for. It allows small wasm binaries (which is the thing we need most), without making things harder on build systems than they need to be. It would require the linker and debugger to support split dwarf and would lead to a slightly more complex optimal deployment method (extra link steps and/or storing/distributing/serving .o/.dwo/.dwp files) but it seems worth it.

Thoughts?
@sbc @yurydelendik @azakai @pfaffe @bmeurer

The text was updated successfully, but these errors were encountered:

yurydelendik · 2019-11-20T14:15:55Z

The split-dwarf will still add some amount of information to the wasm file. I was proposing at WebAssembly/debugging#1 to have entire DWARF data moved to the external file. Benefits will be that a wasm file will contain no debug info -- it is easy for tools to not deal with multiple section in different files.

dschuff · 2019-11-20T19:35:54Z

I agree that that's preferable; the downside would just be that we'd be more different from other platforms. It's probably worth looking more at why the ELF solution splits the debug info the way it does, and how it works on other platforms which have done external DWARF such as Apple and HP.

dschuff · 2019-11-20T19:39:54Z

There's also still the question of whether to have entirely separate files or embed the info in the object files.
Based on a quick look at https://gcc.gnu.org/wiki/DebugFission, all the non-GNU flavors just keep it in the object file. It also looks like the other implementations use summary information in STABS format rather than dwarf.

pfaffe · 2019-11-21T07:55:29Z

Just to emphasize, being forced to serve wasm symbol files over a network is a showstopper. Meaningful applications have symbol files ranging in gigabytes of data.
So for debugging, we need to be able to load symbol files out-of-process and from disk. That means having a single binary is not going to work for us.

I don't have a preference whether there should be two binaries, a stripped and a full one, or whether to do split-dwarf. Both can be achieved in post-processing though, so I don't even think there is a true upside to either.

bmeurer · 2019-11-21T12:31:38Z

I agree with @pfaffe here. The everything in one binary approach is only going to work for reasonable applications if we can load the .wasm file from the file system, which is possible, but then again, we could also just separate the debug information and only load that separately.

I do however also see the benefit of having everything in one binary. That's going to make a lot of steps in the pipeline a lot easier.

dschuff · 2019-11-21T18:51:53Z

Just to emphasize, being forced to serve wasm symbol files over a network is a showstopper. Meaningful applications have symbol files ranging in gigabytes of data.

Indeed, that's why I filed this issue in the first place.

So for debugging, we need to be able to load symbol files out-of-process and from disk. That means having a single binary is not going to work for us.

Let's also bear in mind the eventual use case where the debugger isn't running as a native application but is somehow factored into a standardized debugger module or language component integrated into devtools. It's not exactly clear yet what the consequences of that would be (reduced memory or other resources available? debug info might actually be served on the side by the server? maybe there will be similar filesystem APIs and nothing is much different?) but worth keeping in mind for the longer term.

We can already use the stripped+full-debug-in-one-binary workflow without any extra tool support; putting the stripped binary on the server and loading the full binary in the debugger would be analogous to loading a full native binary in the debugger and then attaching to a process running the stripped binary. But the difference is that stripping would basically be mandatory even for local testing of a large app, which isn't the case for native, and adds extra friction. So it would be nice if we could make it easier.

I investigated at this a bit more late yesterday, using a static debug build of clang. My hope was that the skeleton debug info that gets left behind in the main executable when using -gsplit-dwarfwould be sufficiently small that we could just use that by default. This would mean that a debug binary would still have a small but manageable amount of debug info, and that the compiler and debugger's built-in support for split dwarf would just work the way it does for native, without any wasm-specific binary hackery.

The monolithic debug clang is a ~1.3G binary, about 1.1G of which is debug info:

$ size -A bin/clang-10 
bin/clang-10  :
section                size        addr
<snip>
.debug_info       893568013           0
.debug_abbrev       4065188           0
.debug_line       116678263           0
.debug_str        156134788           0
.debug_loc           489715           0
.debug_macinfo         1620           0
.debug_ranges      30935856           0
Total            1376360075

Debug info for split dwarf:

$ size -A bin/clang-10 
bin/clang-10  :
section                     size        addr
<snip>
.debug_info               121113           0
.debug_abbrev              56513           0
.debug_line            116678263           0
.debug_str                124186           0
.debug_macinfo              1620           0
.debug_ranges           30935856           0
.debug_addr             17478384           0

... in those sections, now only 165M, on the order of the text size. Not fantastic but probably workable for local debugging. However it now also has pubtypes/pubnames sections (accelerated access tables), which are not present by default in the monolithic build:

.debug_gnu_pubnames    478545510           0
.debug_gnu_pubtypes    240543346           0
Total                 1058971423

This is another 685M, for a total of 850M, which isn't that much less than the original, and probably back into showstopper territory. It's not obvious to me why the split version would need the table when the monolithic version wouldn't (perhaps to avoid having to searching a bunch of different files on name lookup is much worse than searching through a single one?), nor why the tables are almost as big as the debug info itself.
But either way it's a bit disappointing.

bmeurer · 2019-11-28T06:48:48Z

Any progress on this? The way we are currently prototyping this for DevTools is:

Given that we already know that we will need to deal with applications that have gigabytes of debug data, I think it makes sense to focus on the separate .dwo file approach, and get that working and stabilized as quickly as possible. And then once we have this working, we can always look into exploring alternatives.

cc @hashseed

sbc100 · 2019-12-01T21:15:17Z

@bmeurer, in your diagram I assume you meant to write dwp rather then dwo? ( dwo is for a single object file dwp is a whole program).

I started looking into implemented -gsplit-dwarf for wasm in llvm. The first question that needs answering is: should own dwo format used ELF or wasm as container? My feeling is that having both the wasm and dwo be similar file formats is consistent and our tooling will need to support dwarf in wasm anyway for the non-split case to continue to work. It would be strange to have to implement both wasm and EFL container format parsing in the tools right?

dschuff · 2019-12-02T18:16:00Z

I think dwp and dwo work pretty much the same way, the only difference being whether the debugger has to look for the debug info in a bunch of different object files or in one big package. We'll definitely want to support dwp (i.e. making llvm-dwp support wasm) for the use case of archiving debug info for large release binaries, but I don't think it's P0.

And yes, I think we probably want to use wasm containers, unless there's some compelling reason not to.

dschuff · 2020-01-09T21:18:11Z

+cc @paolosevMSFT

kripken · 2020-03-18T19:14:20Z

#10568 implemented basic splitting.

dschuff · 2020-08-15T00:04:34Z

I've started working on support for gsplit-dwarf for wasm: https://reviews.llvm.org/D85685

paolosevMSFT · 2020-08-15T05:19:17Z

Hi Derek, Thanks for the update! Paolo

…

________________________________ From: Derek Schuff <notifications@github.com> Sent: Friday, August 14, 2020 5:04 PM To: emscripten-core/emscripten <emscripten@noreply.github.com> Cc: Paolo Severini <paolosev@microsoft.com>; Mention <mention@noreply.github.com> Subject: Re: [emscripten-core/emscripten] Split debug info for emscripten (#9871) I've started working on support for gsplit-dwarf for wasm: https://reviews.llvm.org/D85685<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Freviews.llvm.org%2FD85685&data=02%7C01%7Cpaolosev%40microsoft.com%7Ca7e717cab6ac482d4b0008d840aed26c%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637330466882772817&sdata=2kZHTeKQHoJ4v5X%2FS1emwpoO4eNenncRSnmaV5IjN34%3D&reserved=0> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Femscripten-core%2Femscripten%2Fissues%2F9871%23issuecomment-674316686&data=02%7C01%7Cpaolosev%40microsoft.com%7Ca7e717cab6ac482d4b0008d840aed26c%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637330466882777808&sdata=qa56s%2Bh7vEx0hZ0oHNH3LRZEStEIG0kkP5sF9uiqV3k%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD6PXU5G2MBMTH35VOVZHWLSAXGJ5ANCNFSM4JPKY33A&data=02%7C01%7Cpaolosev%40microsoft.com%7Ca7e717cab6ac482d4b0008d840aed26c%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637330466882782799&sdata=kp0k6FO5Fefa9JC4KkaMBCAOl%2BjjDrf1doWg9tOMnfA%3D&reserved=0>.

dschuff · 2021-01-13T23:14:58Z

Since split-dwarf is currently implemented, I'm going to close this issue. We can open new ones for bugs or future features (for example, since we support split-dwarf and will probably use it in some form, we should also support DWP files.)

dschuff · 2021-01-13T23:26:21Z

Filed #13251

bmeurer mentioned this issue Dec 23, 2020

Emscripten post processing doesn't correctly update .debug_addr section #13099

Closed

dschuff closed this as completed Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split debug info for emscripten #9871

Split debug info for emscripten #9871

dschuff commented Nov 19, 2019 •

edited

Loading

yurydelendik commented Nov 20, 2019

dschuff commented Nov 20, 2019

dschuff commented Nov 20, 2019

pfaffe commented Nov 21, 2019

bmeurer commented Nov 21, 2019

dschuff commented Nov 21, 2019

bmeurer commented Nov 28, 2019

sbc100 commented Dec 1, 2019

dschuff commented Dec 2, 2019

dschuff commented Jan 9, 2020

kripken commented Mar 18, 2020 •

edited

Loading

dschuff commented Aug 15, 2020

paolosevMSFT commented Aug 15, 2020 via email

dschuff commented Jan 13, 2021

dschuff commented Jan 13, 2021

Split debug info for emscripten #9871

Split debug info for emscripten #9871

Comments

dschuff commented Nov 19, 2019 • edited Loading

yurydelendik commented Nov 20, 2019

dschuff commented Nov 20, 2019

dschuff commented Nov 20, 2019

pfaffe commented Nov 21, 2019

bmeurer commented Nov 21, 2019

dschuff commented Nov 21, 2019

bmeurer commented Nov 28, 2019

sbc100 commented Dec 1, 2019

dschuff commented Dec 2, 2019

dschuff commented Jan 9, 2020

kripken commented Mar 18, 2020 • edited Loading

dschuff commented Aug 15, 2020

paolosevMSFT commented Aug 15, 2020 via email

dschuff commented Jan 13, 2021

dschuff commented Jan 13, 2021

dschuff commented Nov 19, 2019 •

edited

Loading

kripken commented Mar 18, 2020 •

edited

Loading