Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split debug info for emscripten #9871

Closed
dschuff opened this issue Nov 19, 2019 · 15 comments
Closed

Split debug info for emscripten #9871

dschuff opened this issue Nov 19, 2019 · 15 comments

Comments

@dschuff
Copy link
Member

dschuff commented Nov 19, 2019

Here's an overview of debug-info splitting options on ELF platforms and how we might apply them to wasm. It doesn't discuss yet how it would be implemented.

Currently LLVM supports outputting debug info in wasm object files, in the traditional GNU manner where all debug info is in the object file in several sections such as .debug_info, and linked into the executable in the same sections. For deployment, binaries can be built optimized but with debug info, and then stripped; a copy of the full binary can then be archived, and used to symbolize or debug the stripped binary.
This has the advantage that it uses the minimal number of files, the simplest compile+link flow (every build system can handle it), and the debugger needs only one file to debug the binary. It has the disadvantage that the linker must merge all the debug info on every incremental build (slowing the link), and the resulting executable is very large. This latter disadvantage is especially important for wasm, because (even when just debugging) the binary must be sent over a network (even if a fast one) and loaded into the VM (which is much more expensive than a Linux loader which just needs to mmap the sections).

GCC and LLVM for ELF targets support "split-dwarf" mode, using the -gsplit-dwarf flag (a good overview is here), which splits most of the debug info out from each object into a separate dwo file. In that case the object file has a much smaller .debug_info section and its dwo file has a large .debug_info.dwo section with most of the info. The debugger must then look up any required dwo file on demand when debugging. This mitigates both of the aforementioned disadvantages of traditional debug info. One disadvantage of this approach is that all of the dwo files must be available to debug the binary, and moving them around is annoying (many files, need to preserve directory hierarchy and pathnames). A second disadvantage is that there are now 2 output files for each C file, which makes things more challenging for build systems.

The dwp tool can be used to combine all of the debug info in dwo files into a single .dwp file that goes alongside the executable, which mitigates that first disadvantage, at the cost of invoking another tool at link time. Clang has a decent solution for the second issue in the form of a second split-dwarf variant, -gsplit-dwarf=single. This splits the debug info into the same sections as the other split-dwarf variant but puts the .debug_info.dwo section in the object file instead of a separate dwo file. The linker only links the .debug_info sections into the final executable (not the .debug_info.dwo sections), keeping the advantages of splitting without the extra build system pain. dwp works the same way.

In all these cases, there is still a small amount of debug info in the final binary, which is unfortunate but I don't know of a better way in the ELF world.

It sounds to me like using -gsplit-dwarf=single is a good goal to shoot for. It allows small wasm binaries (which is the thing we need most), without making things harder on build systems than they need to be. It would require the linker and debugger to support split dwarf and would lead to a slightly more complex optimal deployment method (extra link steps and/or storing/distributing/serving .o/.dwo/.dwp files) but it seems worth it.

Thoughts?
@sbc @yurydelendik @azakai @pfaffe @bmeurer

@yurydelendik
Copy link
Collaborator

The split-dwarf will still add some amount of information to the wasm file. I was proposing at WebAssembly/debugging#1 to have entire DWARF data moved to the external file. Benefits will be that a wasm file will contain no debug info -- it is easy for tools to not deal with multiple section in different files.

@dschuff
Copy link
Member Author

dschuff commented Nov 20, 2019

I agree that that's preferable; the downside would just be that we'd be more different from other platforms. It's probably worth looking more at why the ELF solution splits the debug info the way it does, and how it works on other platforms which have done external DWARF such as Apple and HP.

@dschuff
Copy link
Member Author

dschuff commented Nov 20, 2019

There's also still the question of whether to have entirely separate files or embed the info in the object files.
Based on a quick look at https://gcc.gnu.org/wiki/DebugFission, all the non-GNU flavors just keep it in the object file. It also looks like the other implementations use summary information in STABS format rather than dwarf.

@pfaffe
Copy link
Collaborator

pfaffe commented Nov 21, 2019

Just to emphasize, being forced to serve wasm symbol files over a network is a showstopper. Meaningful applications have symbol files ranging in gigabytes of data.
So for debugging, we need to be able to load symbol files out-of-process and from disk. That means having a single binary is not going to work for us.

I don't have a preference whether there should be two binaries, a stripped and a full one, or whether to do split-dwarf. Both can be achieved in post-processing though, so I don't even think there is a true upside to either.

@bmeurer
Copy link
Contributor

bmeurer commented Nov 21, 2019

I agree with @pfaffe here. The everything in one binary approach is only going to work for reasonable applications if we can load the .wasm file from the file system, which is possible, but then again, we could also just separate the debug information and only load that separately.

I do however also see the benefit of having everything in one binary. That's going to make a lot of steps in the pipeline a lot easier.

@dschuff
Copy link
Member Author

dschuff commented Nov 21, 2019

Just to emphasize, being forced to serve wasm symbol files over a network is a showstopper. Meaningful applications have symbol files ranging in gigabytes of data.

Indeed, that's why I filed this issue in the first place.

So for debugging, we need to be able to load symbol files out-of-process and from disk. That means having a single binary is not going to work for us.

Let's also bear in mind the eventual use case where the debugger isn't running as a native application but is somehow factored into a standardized debugger module or language component integrated into devtools. It's not exactly clear yet what the consequences of that would be (reduced memory or other resources available? debug info might actually be served on the side by the server? maybe there will be similar filesystem APIs and nothing is much different?) but worth keeping in mind for the longer term.

We can already use the stripped+full-debug-in-one-binary workflow without any extra tool support; putting the stripped binary on the server and loading the full binary in the debugger would be analogous to loading a full native binary in the debugger and then attaching to a process running the stripped binary. But the difference is that stripping would basically be mandatory even for local testing of a large app, which isn't the case for native, and adds extra friction. So it would be nice if we could make it easier.

I investigated at this a bit more late yesterday, using a static debug build of clang. My hope was that the skeleton debug info that gets left behind in the main executable when using -gsplit-dwarfwould be sufficiently small that we could just use that by default. This would mean that a debug binary would still have a small but manageable amount of debug info, and that the compiler and debugger's built-in support for split dwarf would just work the way it does for native, without any wasm-specific binary hackery.

The monolithic debug clang is a ~1.3G binary, about 1.1G of which is debug info:

$ size -A bin/clang-10 
bin/clang-10  :
section                size        addr
<snip>
.debug_info       893568013           0
.debug_abbrev       4065188           0
.debug_line       116678263           0
.debug_str        156134788           0
.debug_loc           489715           0
.debug_macinfo         1620           0
.debug_ranges      30935856           0
Total            1376360075

Debug info for split dwarf:

$ size -A bin/clang-10 
bin/clang-10  :
section                     size        addr
<snip>
.debug_info               121113           0
.debug_abbrev              56513           0
.debug_line            116678263           0
.debug_str                124186           0
.debug_macinfo              1620           0
.debug_ranges           30935856           0
.debug_addr             17478384           0

... in those sections, now only 165M, on the order of the text size. Not fantastic but probably workable for local debugging. However it now also has pubtypes/pubnames sections (accelerated access tables), which are not present by default in the monolithic build:

.debug_gnu_pubnames    478545510           0
.debug_gnu_pubtypes    240543346           0
Total                 1058971423

This is another 685M, for a total of 850M, which isn't that much less than the original, and probably back into showstopper territory. It's not obvious to me why the split version would need the table when the monolithic version wouldn't (perhaps to avoid having to searching a bunch of different files on name lookup is much worse than searching through a single one?), nor why the tables are almost as big as the debug info itself.
But either way it's a bit disappointing.

@bmeurer
Copy link
Contributor

bmeurer commented Nov 28, 2019

Any progress on this? The way we are currently prototyping this for DevTools is:

image

Given that we already know that we will need to deal with applications that have gigabytes of debug data, I think it makes sense to focus on the separate .dwo file approach, and get that working and stabilized as quickly as possible. And then once we have this working, we can always look into exploring alternatives.

cc @hashseed

@sbc100
Copy link
Collaborator

sbc100 commented Dec 1, 2019

@bmeurer, in your diagram I assume you meant to write dwp rather then dwo? ( dwo is for a single object file dwp is a whole program).

I started looking into implemented -gsplit-dwarf for wasm in llvm. The first question that needs answering is: should own dwo format used ELF or wasm as container? My feeling is that having both the wasm and dwo be similar file formats is consistent and our tooling will need to support dwarf in wasm anyway for the non-split case to continue to work. It would be strange to have to implement both wasm and EFL container format parsing in the tools right?

@dschuff
Copy link
Member Author

dschuff commented Dec 2, 2019

I think dwp and dwo work pretty much the same way, the only difference being whether the debugger has to look for the debug info in a bunch of different object files or in one big package. We'll definitely want to support dwp (i.e. making llvm-dwp support wasm) for the use case of archiving debug info for large release binaries, but I don't think it's P0.

And yes, I think we probably want to use wasm containers, unless there's some compelling reason not to.

@dschuff
Copy link
Member Author

dschuff commented Jan 9, 2020

+cc @paolosevMSFT

@kripken
Copy link
Member

kripken commented Mar 18, 2020

#10568 implemented basic splitting.

@dschuff
Copy link
Member Author

dschuff commented Aug 15, 2020

I've started working on support for gsplit-dwarf for wasm: https://reviews.llvm.org/D85685

@paolosevMSFT
Copy link

paolosevMSFT commented Aug 15, 2020 via email

@dschuff
Copy link
Member Author

dschuff commented Jan 13, 2021

Since split-dwarf is currently implemented, I'm going to close this issue. We can open new ones for bugs or future features (for example, since we support split-dwarf and will probably use it in some form, we should also support DWP files.)

@dschuff dschuff closed this as completed Jan 13, 2021
@dschuff
Copy link
Member Author

dschuff commented Jan 13, 2021

Filed #13251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants