-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498
Comments
I have two hunches for this problem:
Other than that, probably the next step here will be reducing the problem to a smaller compilation unit that still exhibits the problem. |
I played around with bloaty and made some progress on this issue. The comparison feature of bloaty is not useful so I have instead the output of stage2 and stage3 side by side. These are compiled without LLVM, in release mode, with a hack to enable build ids, required by bloaty (see #3047): --- a/src/link/Elf.zig
+++ b/src/link/Elf.zig
@@ -1450,6 +1450,8 @@ fn linkWithLLD(self: *Elf, comp: *Compilation, prog_node: *std.Progress.Node) !v
if (self.base.options.output_mode == .Exe) {
try argv.append("-z");
try argv.append(try std.fmt.allocPrint(arena, "stack-size={d}", .{stack_size}));
+
+ try argv.append("--build-id");
}
if (self.base.options.image_base_override) |image_base| { stage2
stage3
Observations
Final observation, we may want to look into this as a prerequisite: Lines 2216 to 2217 in 97816e3
Remember these are release builds - if we are not setting enough LLVM attributes then the optimizer might not be able to inline & delete code as much. |
OK I think I narrowed this down a little bit further. I had a look at In particular, the stage1-generated LLVM IR has 22 alloca instructions with an ArrayListAligned type. However the stage2-generated LLVM IR has 1753 alloca instructions with an ArrayListAligned type. That definitely seems... suboptimal. I suspect if we figure out why the LLVM backend is creating so many alloca instructions, we may have solved at least a big chunk of this issue. Another random observation: currently we lower slice parameters as struct values in LLVM IR, however, if we lower them as separate ptr/len parameters, we could put the "nonnull" attribute on the pointer (and "readonly" if it is a const slice). This would be an improvement over stage1. Aha, this is #561 |
Here is a small snippet of code we can look at: fn doTheTest() !void {
var list = std.ArrayList(?[]const u8).init(std.heap.page_allocator);
defer list.deinit();
} stage1 LLVM IR: ; Function Attrs: nobuiltin nounwind
define internal fastcc i16 @doTheTest() unnamed_addr #1 {
Entry:
%result = alloca i16, align 2
%list = alloca %"std.array_list.ArrayListAligned(?[]const u8,null)", align 8
call fastcc void @"std.array_list.ArrayListAligned(?[]const u8,null).init"(%"std.array_list.ArrayListAligned(?[]const u8,null)"* sret(%"std.array_list.ArrayListAligned(?[]const u8,null)") %list, %std.mem.Allocator* @1)
store i16 0, i16* %result, align 2
call fastcc void @"std.array_list.ArrayListAligned(?[]const u8,null).deinit"(%"std.array_list.ArrayListAligned(?[]const u8,null)"* %list)
ret i16 0
} stage2 LLVM IR: ; Function Attrs: nounwind
define internal fastcc i16 @test2.doTheTest() unnamed_addr #0 {
Entry:
%0 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
%1 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
%2 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
call fastcc void @"array_list.ArrayListAligned(?[]const u8,null).init"(%"array_list.ArrayListAligned(?[]const u8,null)"* sret(%"array_list.ArrayListAligned(?[]const u8,null)") %1, %mem.Allocator.mem.Allocator { i8* undef, %mem.Allocator.VTable* @heap.PageAllocator.vtable })
%3 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %2 to i8*
%4 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %1 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %3, i8* align 8 %4, i64 40, i1 false)
%5 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %0 to i8*
%6 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %2 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %5, i8* align 8 %6, i64 40, i1 false)
call fastcc void @"array_list.ArrayListAligned(?[]const u8,null).deinit"(%"array_list.ArrayListAligned(?[]const u8,null)"* %0)
ret i16 0
} AIR:
Observations
We should be able to improve the LLVM backend by lowering these instructions differently so that in init() the function call directly initializes the canonical local variable that represents the struct, and the deinit() function directly takes the pointer to the struct instead of a copy. I also noticed that the deinit() function in stage1 has |
const S = struct {
a: i32,
b: i32,
c: i32,
};
test {
var s: S = undefined;
_ = s;
} I also noticed that this example emits a memset even when compiled with -OReleaseFast. This might be the actual main problem. |
Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.
Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.
This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.
Some binary size numbers after those two commits:
So, these two things helped but did not bring us all the way home yet. |
Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.
Worth noting - the above changes seem to have had little to no effect on the relative size of stage2
stage3
|
This moves some logic from resolveLlvmFunction to updateFunc and takes advantage of the iteration we already do that takes into account C ABI lowering, making LLVM parameter attributes accurate for C ABI functions as well as our own unspecified calling convention. Related to #11498.
With the above commit, we have closed the perf gap:
However the bloat gap is still there, so I think this is still worth investigating. |
Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to #11498.
Well, maybe the latest three master branch commits helped perf (we will see on https://ziglang.org/perf/ shortly) but they did not help with the binary bloat:
|
I spent some time examining stage1-generated LLVM IR vs stage2-generated LLVM IR post optimizations with if (!init_a()) goto cleanup_a;
if (!init_b()) goto cleanup_b;
if (!init_c()) goto cleanup_c;
cleanup_all:
deinit_c();
cleanup_c:
deinit_b();
cleanup_b:
deinit_a();
cleanup_a:
return; whereas the stage2-generated LLVM IR had this pattern: if (!init_a()) {
return;
}
if (!init_b()) {
deinit_a();
return;
}
if (!init_c()) {
deinit_b();
deinit_a();
return;
}
deinit_c();
deinit_b();
deinit_a();
return; This would explain why Related: #283 So, my list of untested ideas is now:
|
As of latest master (0e26c61) the bloat issue is still here and still pretty severe:
Edit: I forgot to strip. It's mostly extra debug info, which is actually nice. Stripped:
The |
After the merge of #12164, landed in 6fab6c3:
With a 0.2% difference in stripped binary size, I think we can call this closed. Speed:
Explanation:
Conclusions:
|
This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.
Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.
This moves some logic from resolveLlvmFunction to updateFunc and takes advantage of the iteration we already do that takes into account C ABI lowering, making LLVM parameter attributes accurate for C ABI functions as well as our own unspecified calling convention. Related to #11498.
Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to #11498.
This moves some logic from resolveLlvmFunction to updateFunc and takes advantage of the iteration we already do that takes into account C ABI lowering, making LLVM parameter attributes accurate for C ABI functions as well as our own unspecified calling convention. Related to ziglang#11498.
Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to ziglang#11498.
This branch originally started out as a potential workaround to address ziglang#11450. It did not solve that problem, however, it did end up fixing ziglang#11498!
Here I build the self-hosted compiler with
-OReleaseFast
with stage1 and with stage2, then compare their sizes:Without stripping, we have:
However that may be due to better debug info in stage2-generated binaries, so let's look at stripped versions:
So there is an extra 2 MiB here unaccounted for. Also, I would have expected smaller binary size because in many ways we generate more efficient LLVM IR in stage2 than in stage1.
It is indeed the .text segment:
I also noticed that stage2-generated binaries are slightly worse in terms of runtime performance, possibly for the same reason:
The text was updated successfully, but these errors were encountered: