stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498

andrewrk · 2022-04-22T16:39:11Z

Here I build the self-hosted compiler with -OReleaseFast with stage1 and with stage2, then compare their sizes:

./zig build -p stage2-release -Denable-llvm -Dskip-install-lib-files -Drelease
./stage3/bin/zig build -p stage3-release -Denable-llvm -Dskip-install-lib-files -Drelease
cp stage2-release/bin/zig ~/tmp/zig2
cp stage3-release/bin/zig ~/tmp/zig3

Without stripping, we have:

$ ls -hl ~/tmp/zig{2,3}
-rwxr-xr-x 1 andy users 190M Apr 22 09:27 /home/andy/tmp/zig2
-rwxr-xr-x 1 andy users 205M Apr 22 09:27 /home/andy/tmp/zig3

However that may be due to better debug info in stage2-generated binaries, so let's look at stripped versions:

$ strip ~/tmp/zig2
$ strip ~/tmp/zig3
$ ls -hl ~/tmp/zig{2,3}
-rwxr-xr-x 1 andy users 138M Apr 22 09:30 /home/andy/tmp/zig2
-rwxr-xr-x 1 andy users 140M Apr 22 09:30 /home/andy/tmp/zig3

So there is an extra 2 MiB here unaccounted for. Also, I would have expected smaller binary size because in many ways we generate more efficient LLVM IR in stage2 than in stage1.

It is indeed the .text segment:

$ bloaty zig3 -- zig2
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +3.0% +2.63Mi  +3.0% +2.63Mi    .text
  +0.5%  +143Ki  +0.5%  +143Ki    .rodata
  +0.5% +39.0Ki  +0.5% +39.0Ki    .eh_frame
  +0.1%    +672  +0.1%    +672    .eh_frame_hdr
  +0.5%    +192  +0.5%    +192    .data
  [NEW]     +80  [NEW]     +16    .tdata
  [ = ]       0   +50%      +8    [LOAD #5 [RW]]
  +2.2%      +7  [ = ]       0    .shstrtab
  +4.2%      +1  [ = ]       0    [Unmapped]
  -0.3%      -2  -0.3%      -2    .gnu.version
  -0.2%      -6  -0.2%      -6    .dynstr
  -0.3%      -8  -0.3%      -8    .got.plt
  -0.3%      -8  -0.3%      -8    .hash
  -0.3%     -16  -0.3%     -16    .plt
  -0.3%     -24  -0.3%     -24    .dynsym
  -0.3%     -24  -0.3%     -24    .rela.plt
  [ = ]       0 -38.5%     -40    .tbss
  [ = ]       0  -0.0%    -160    .bss
  -0.8% -71.2Ki  -0.8% -71.3Ki    .data.rel.ro
  +2.0% +2.74Mi  +2.0% +2.74Mi    TOTAL

I also noticed that stage2-generated binaries are slightly worse in terms of runtime performance, possibly for the same reason:

$ time ./stage2-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m44.388s
user	0m43.291s
sys	0m1.522s

$ time ./stage3-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m46.431s
user	0m45.316s
sys	0m1.748s

The text was updated successfully, but these errors were encountered:

andrewrk · 2022-04-29T01:23:19Z

I have two hunches for this problem:

Redundant panic handling code (as an example the message is duplicated for each instance)
Lack of caching @typeInfo results

Other than that, probably the next step here will be reducing the problem to a smaller compilation unit that still exhibits the problem.

andrewrk · 2022-05-26T00:15:03Z

I played around with bloaty and made some progress on this issue. The comparison feature of bloaty is not useful so I have instead the output of stage2 and stage3 side by side. These are compiled without LLVM, in release mode, with a hack to enable build ids, required by bloaty (see #3047):

--- a/src/link/Elf.zig
+++ b/src/link/Elf.zig
@@ -1450,6 +1450,8 @@ fn linkWithLLD(self: *Elf, comp: *Compilation, prog_node: *std.Progress.Node) !v
         if (self.base.options.output_mode == .Exe) {
             try argv.append("-z");
             try argv.append(try std.fmt.allocPrint(arena, "stack-size={d}", .{stack_size}));
+
+            try argv.append("--build-id");
         }
 
         if (self.base.options.image_base_override) |image_base| {

stage2

[nix-shell:~/tmp]$ bloaty zig2-no-llvm --debug-file=$HOME/Downloads/zig/build-release/stage2-release-no-llvm/bin/zig  -d symbols
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.2%  4.95Mi  74.2%  4.97Mi    [16217 Others]
   6.7%   459Ki   6.7%   459Ki    [section .rodata]
   4.0%   274Ki   4.0%   274Ki    buildOutputType
   2.5%   170Ki   2.5%   170Ki    Sema.analyzeBodyInner
   1.7%   116Ki   1.7%   116Ki    std.fmt.Formatter(type.Type.dump).format.8702
   1.4%  95.7Ki   1.4%  95.7Ki    std.fmt.Formatter(value.Value.dump).format.19917
   1.0%  66.4Ki   1.0%  66.4Ki    arch.wasm.CodeGen.genBody
   1.0%  66.3Ki   1.0%  66.3Ki    __unnamed_262
   0.8%  56.6Ki   0.8%  56.6Ki    __unnamed_7156
   0.8%  52.6Ki   0.8%  52.6Ki    link.Wasm.flush
   0.7%  50.0Ki   0.7%  50.0Ki    arch.x86_64.CodeGen.genBody
   0.6%  43.2Ki   0.6%  43.2Ki    std.sort.sort
   0.6%  40.6Ki   0.6%  40.6Ki    type.Type.dump.13097
   0.5%  37.6Ki   0.5%  37.6Ki    std.sort.sort.2026
   0.5%  37.1Ki   0.5%  37.1Ki    std.sort.sort.9839
   0.5%  37.0Ki   0.5%  37.0Ki    std.sort.sort.2429
   0.5%  35.1Ki   0.5%  35.1Ki    [section .text]
   0.5%  34.1Ki   0.5%  34.1Ki    main
   0.5%  34.0Ki   0.5%  34.0Ki    arch.arm.CodeGen.genBody
   0.5%  31.9Ki   0.5%  31.9Ki    arch.aarch64.CodeGen.genBody
   0.4%  30.1Ki   0.4%  30.1Ki    Compilation.workerAstGenFile
 100.0%  6.68Mi 100.0%  6.70Mi    TOTAL

stage3

[nix-shell:~/tmp]$ bloaty zig3-no-llvm --debug-file=$HOME/Downloads/zig/build-release/stage3-release-no-llvm/bin/zig  -d symbols 
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  75.1%  6.14Mi  75.1%  6.16Mi    [14778 Others]
   8.0%   672Ki   8.0%   672Ki    [section .rodata]
   5.1%   424Ki   5.1%   424Ki    main.buildOutputType
   2.3%   190Ki   2.3%   190Ki    Sema.analyzeBodyInner
   1.2%   104Ki   1.2%   104Ki    arch.wasm.CodeGen.genBody
   0.8%  66.3Ki   0.8%  66.3Ki    clang_options_data.data__anon_12018
   0.7%  58.7Ki   0.7%  58.7Ki    Compilation.generateBuiltinZigSource
   0.7%  56.6Ki   0.7%  56.6Ki    link.C.zig_h__anon_38434
   0.7%  56.5Ki   0.7%  56.5Ki    arch.x86_64.CodeGen.genBody
   0.6%  54.3Ki   0.6%  54.3Ki    codegen.c.genBody
   0.6%  47.5Ki   0.6%  47.5Ki    main.main
   0.5%  43.1Ki   0.5%  43.1Ki    sort.sort__anon_111851
   0.5%  40.4Ki   0.5%  40.4Ki    sort.sort__anon_35809
   0.5%  37.9Ki   0.5%  37.9Ki    codegen.generateFunction
   0.4%  36.9Ki   0.4%  36.9Ki    arch.arm.CodeGen.genBody
   0.4%  36.6Ki   0.4%  36.6Ki    Sema.zirSwitchBlock
   0.4%  36.3Ki   0.4%  36.3Ki    arch.aarch64.CodeGen.genBody
   0.4%  33.1Ki   0.4%  33.1Ki    [section .text]
   0.4%  32.0Ki   0.4%  32.0Ki    Compilation.update
   0.4%  31.6Ki   0.4%  31.6Ki    Compilation.processOneJob
   0.4%  31.2Ki   0.4%  31.2Ki    Compilation.workerAstGenFile
 100.0%  8.19Mi 100.0%  8.20Mi    TOTAL

Observations

Runtime safety is not relevant here since these are release builds.
stage3 actually generates less garbage (fewer symbols) which is nice.
stage3 generates 46% more bytes of global constants (section .rodata).
stage2 is still winning the bloat-off despite the Type/Value dump function being disabled (due to a compiler bug) in stage3
Congrats to stage3 for naming @embedFile("zig.h") "link.C.zig_h__anon_38434" instead of "__unnamed_7156" (see also move zig.h to become an installation file that can be changed at runtime #11643)
- same thing for the symbol name "clang_options_data.data__anon_12018" instead of "__unnamed_262"
It looks like nearly every function is more bloated in stage3 than in stage2. We should be able to take any single function as an example and compare the LLVM IR.

Final observation, we may want to look into this as a prerequisite:

zig/src/codegen/llvm.zig

Lines 2216 to 2217 in 97816e3

    
           // Set parameter attributes. 
        
           // TODO: more attributes. see codegen.cpp `make_fn_llvm_value`.

Remember these are release builds - if we are not setting enough LLVM attributes then the optimizer might not be able to inline & delete code as much.

andrewrk · 2022-05-30T20:31:46Z

OK I think I narrowed this down a little bit further. I had a look at buildOutputType which is almost 2x bigger in a stage2-generated binary. One noticeable difference is that the stage1-generated LLVM IR has a total of 1224 alloca instructions (stack variables) whereas the stage2-generated LLVM IR has 2345.

In particular, the stage1-generated LLVM IR has 22 alloca instructions with an ArrayListAligned type. However the stage2-generated LLVM IR has 1753 alloca instructions with an ArrayListAligned type. That definitely seems... suboptimal. I suspect if we figure out why the LLVM backend is creating so many alloca instructions, we may have solved at least a big chunk of this issue.

Another random observation: currently we lower slice parameters as struct values in LLVM IR, however, if we lower them as separate ptr/len parameters, we could put the "nonnull" attribute on the pointer (and "readonly" if it is a const slice). This would be an improvement over stage1. Aha, this is #561

andrewrk · 2022-05-30T22:14:13Z

Here is a small snippet of code we can look at:

fn doTheTest() !void {
    var list = std.ArrayList(?[]const u8).init(std.heap.page_allocator);
    defer list.deinit();
}

stage1 LLVM IR:

; Function Attrs: nobuiltin nounwind
define internal fastcc i16 @doTheTest() unnamed_addr #1 {
Entry:
  %result = alloca i16, align 2
  %list = alloca %"std.array_list.ArrayListAligned(?[]const u8,null)", align 8
  call fastcc void @"std.array_list.ArrayListAligned(?[]const u8,null).init"(%"std.array_list.ArrayListAligned(?[]const u8,null)"* sret(%"std.array_list.ArrayListAligned(?[]const u8,null)") %list, %std.mem.Allocator* @1)
  store i16 0, i16* %result, align 2
  call fastcc void @"std.array_list.ArrayListAligned(?[]const u8,null).deinit"(%"std.array_list.ArrayListAligned(?[]const u8,null)"* %list)
  ret i16 0
}

stage2 LLVM IR:

; Function Attrs: nounwind
define internal fastcc i16 @test2.doTheTest() unnamed_addr #0 {
Entry:
  %0 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
  %1 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
  %2 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
  call fastcc void @"array_list.ArrayListAligned(?[]const u8,null).init"(%"array_list.ArrayListAligned(?[]const u8,null)"* sret(%"array_list.ArrayListAligned(?[]const u8,null)") %1, %mem.Allocator.mem.Allocator { i8* undef, %mem.Allocator.VTable* @heap.PageAllocator.vtable })
  %3 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %2 to i8*
  %4 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %3, i8* align 8 %4, i64 40, i1 false)
  %5 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %0 to i8*
  %6 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %5, i8* align 8 %6, i64 40, i1 false)
  call fastcc void @"array_list.ArrayListAligned(?[]const u8,null).deinit"(%"array_list.ArrayListAligned(?[]const u8,null)"* %0)
  ret i16 0
}

AIR:

# Begin Function AIR: doTheTest:
# Total AIR+Liveness bytes: 980B
# AIR Instructions:         64 (576B)
# AIR Extra Data:           13 (52B)
# AIR Values Bytes:         26 (208B)
# Liveness tomb_bits:       32B
# Liveness Extra Data:      0 (0B)
# Liveness special table:   0 (0B)
  %0 = const_ty((inferred_alloc_mut))
  %2 = const_ty(*const type)
  %3 = constant(*const type, (decl_ref Module.Decl.Index(136)))
  %4 = constant(type, (struct decl=Module.Decl.Index(0)))
  %5 = const_ty(*const fn(type) type)
  %6 = constant(*const fn(type) type, (decl_ref Module.Decl.Index(3)))
  %7 = const_ty(fn(type) type)
  %8 = constant(fn(type) type, (function decl=Module.Decl.Index(228)))
  %9 = const_ty(?[]const u8)
  %11 = const_ty(*type)
  %12 = constant(*type, (decl_ref_mut Module.Decl.Index(253)))
  %13 = const_ty(*const fn(type, ?u29) type)
  %14 = constant(*const fn(type, ?u29) type, (decl_ref Module.Decl.Index(229)))
  %15 = const_ty(fn(type, ?u29) type)
  %16 = constant(fn(type, ?u29) type, (function decl=Module.Decl.Index(229)))
  %17 = const_ty(?u29)
  %19 = const_ty(u29)
  %20 = const_ty(?u29)
  %21 = const_ty(?u29)
  %22 = constant(?u29, null)
  %23 = const_ty(*const type)
  %24 = constant(*const type, (decl_ref Module.Decl.Index(254)))
  %25 = constant(type, (struct decl=Module.Decl.Index(254)))
  %26 = constant(type, (struct decl=Module.Decl.Index(254)))
  %27 = const_ty(*const type)
  %28 = constant(*const type, (decl_ref Module.Decl.Index(298)))
  %29 = constant(type, (struct decl=Module.Decl.Index(254)))
  %30 = const_ty(*const fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)))
  %31 = constant(*const fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)), (decl_ref Module.Decl.Index(257)))
  %32 = const_ty(fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)))
  %33 = constant(fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)), (function decl=Module.Decl.Index(257)))
  %34 = const_ty((struct decl=Module.Decl.Index(511)))
  %35 = const_ty(*const type)
  %36 = constant(*const type, (decl_ref Module.Decl.Index(136)))
  %37 = constant(type, (struct decl=Module.Decl.Index(0)))
  %38 = const_ty(*const type)
  %39 = constant(*const type, (decl_ref Module.Decl.Index(66)))
  %40 = constant(type, (struct decl=Module.Decl.Index(560)))
  %41 = const_ty(*const (struct decl=Module.Decl.Index(511)))
  %42 = constant(*const (struct decl=Module.Decl.Index(511)), (decl_ref Module.Decl.Index(587)))
  %43 = const_ty((struct decl=Module.Decl.Index(511)))
  %44 = constant((struct decl=Module.Decl.Index(511)), (aggregate))
  %45 = const_ty((struct decl=Module.Decl.Index(254)))
  %47 = const_ty(*(struct decl=Module.Decl.Index(254)))
  %49 = const_ty((struct decl=Module.Decl.Index(254)))
  %51 = const_ty((struct decl=Module.Decl.Index(254)))
  %52 = const_ty(*const fn((struct decl=Module.Decl.Index(254))) void)
  %53 = constant(*const fn((struct decl=Module.Decl.Index(254))) void, (decl_ref Module.Decl.Index(259)))
  %54 = const_ty(fn((struct decl=Module.Decl.Index(254))) void)
  %55 = constant(fn((struct decl=Module.Decl.Index(254))) void, (function decl=Module.Decl.Index(259)))
  %56 = const_ty((struct decl=Module.Decl.Index(254)))
  %58 = const_ty(bound_fn)
  %59 = constant(bound_fn, (bound_fn %Zir.Inst.Ref(130)(%Zir.Inst.Ref(132)))
  %61 = const_ty((error_set_inferred func=Module.Decl.Index(151))!void)
  %62 = constant((error_set_inferred func=Module.Decl.Index(151))!void, (eu_payload) {})

  %1 = alloc(*(struct decl=Module.Decl.Index(254)))
  %46 = call(%33!, [%44!])
  %48 = bitcast(*(struct decl=Module.Decl.Index(254)), %1)
  %50!= store(%48!, %46!)
  %57 = load((struct decl=Module.Decl.Index(254)), %1!)
  %60!= call(%55!, [%57!])
  %63!= ret(%62!)
# End Function AIR: doTheTest

Observations

In the stage2 LLVM IR there are two extra alloca instructions for the array list
- One is used for the return value of the function call to init(), which is then copied to the real array list local variable.
- The other is used for the parameter to deinit(). A copy of the real array list local variable is made solely for the deinit() call.
The way that defer deinit() is currently lowered will insert a call for every exit point in the scope, this includes every try and return. This explains why buildOutputType - which is a pretty long function that has many exit points - creates so many ArrayList objects. A copy is being made for every deinit() call.

We should be able to improve the LLVM backend by lowering these instructions differently so that in init() the function call directly initializes the canonical local variable that represents the struct, and the deinit() function directly takes the pointer to the struct instead of a copy.

I also noticed that the deinit() function in stage1 has nonnull readonly align 8 on the parameter, however the stage2 LLVM IR only has nonnull.

andrewrk · 2022-05-31T04:47:09Z

const S = struct {
    a: i32,
    b: i32,
    c: i32,
};

test {
    var s: S = undefined;
    _ = s;
}

I also noticed that this example emits a memset even when compiled with -OReleaseFast. This might be the actual main problem.

Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.

This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.

andrewrk · 2022-05-31T08:39:49Z

Some binary size numbers after those two commits:

           stage1: 6925560 bytes
    stage2 master: 8500392 bytes
llvm-elide-memset: 8479208 bytes
  llvm-elide-load: 8329608 bytes

So, these two things helped but did not bring us all the way home yet.

Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.

andrewrk · 2022-05-31T20:56:23Z

Worth noting - the above changes seem to have had little to no effect on the relative size of buildOutputType which was the function whose LLVM IR I was inspecting to learn what kinds of changes to make:

stage2

[nix-shell:~/tmp]$ bloaty zig2 --debug-file=$HOME/dev/zig/build-release/stage2-release/bin/zig  -d symbols
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.5%  4.94Mi  74.5%  4.96Mi    [16239 Others]
   6.8%   459Ki   6.7%   459Ki    [section .rodata]
   4.1%   277Ki   4.1%   277Ki    buildOutputType
   2.0%   132Ki   1.9%   132Ki    Sema.analyzeBodyInner
   1.7%   116Ki   1.7%   116Ki    std.fmt.Formatter(type.Type.dump).format.8715
   1.4%  95.7Ki   1.4%  95.7Ki    std.fmt.Formatter(value.Value.dump).format.19926
   1.0%  66.5Ki   1.0%  66.5Ki    arch.wasm.CodeGen.genBody
   1.0%  66.3Ki   1.0%  66.3Ki    __unnamed_263
   0.8%  56.6Ki   0.8%  56.6Ki    __unnamed_7179
   0.8%  52.6Ki   0.8%  52.6Ki    link.Wasm.flush
   0.7%  50.0Ki   0.7%  50.0Ki    arch.x86_64.CodeGen.genBody
   0.6%  43.2Ki   0.6%  43.2Ki    std.sort.sort
   0.6%  40.6Ki   0.6%  40.6Ki    type.Type.dump.13124
   0.6%  37.6Ki   0.6%  37.6Ki    std.sort.sort.2032
   0.5%  37.1Ki   0.5%  37.1Ki    std.sort.sort.9851
   0.5%  37.0Ki   0.5%  37.0Ki    std.sort.sort.2435
   0.5%  35.2Ki   0.5%  35.2Ki    [section .text]
   0.5%  34.1Ki   0.5%  34.1Ki    main
   0.5%  34.0Ki   0.5%  34.0Ki    arch.arm.CodeGen.genBody
   0.5%  32.0Ki   0.5%  32.0Ki    arch.aarch64.CodeGen.genBody
   0.4%  30.1Ki   0.4%  30.1Ki    Compilation.workerAstGenFile
 100.0%  6.64Mi 100.0%  6.65Mi    TOTAL

stage3

[nix-shell:~/tmp]$ bloaty zig3 --debug-file=$HOME/dev/zig/build-release/stage3-release/bin/zig  -d symbols
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  75.2%  6.00Mi  75.2%  6.01Mi    [14783 Others]
   8.4%   683Ki   8.3%   683Ki    [section .rodata]
   5.4%   438Ki   5.4%   438Ki    main.buildOutputType
   1.7%   134Ki   1.6%   134Ki    Sema.analyzeBodyInner
   1.2%  95.3Ki   1.2%  95.3Ki    arch.wasm.CodeGen.genBody
   0.8%  66.3Ki   0.8%  66.3Ki    clang_options_data.data__anon_12016
   0.7%  58.4Ki   0.7%  58.4Ki    Compilation.generateBuiltinZigSource
   0.7%  56.6Ki   0.7%  56.6Ki    link.C.zig_h__anon_37409
   0.7%  53.4Ki   0.7%  53.4Ki    codegen.c.genBody
   0.6%  51.4Ki   0.6%  51.4Ki    arch.x86_64.CodeGen.genBody
   0.6%  45.9Ki   0.6%  45.9Ki    main.main
   0.5%  43.1Ki   0.5%  43.1Ki    RangeSet.spans
   0.5%  38.4Ki   0.5%  38.4Ki    sort.sort__anon_33752
   0.4%  36.2Ki   0.4%  36.2Ki    codegen.generateFunction
   0.4%  35.0Ki   0.4%  35.0Ki    Sema.zirSwitchBlock
   0.4%  34.3Ki   0.4%  34.3Ki    arch.arm.CodeGen.genBody
   0.4%  31.8Ki   0.4%  31.8Ki    [section .text]
   0.4%  31.8Ki   0.4%  31.8Ki    arch.aarch64.CodeGen.genBody
   0.4%  31.0Ki   0.4%  31.0Ki    Compilation.processOneJob
   0.4%  30.4Ki   0.4%  30.4Ki    Compilation.update
   0.4%  30.4Ki   0.4%  30.4Ki    Compilation.workerAstGenFile
 100.0%  7.98Mi 100.0%  7.99Mi    TOTAL

This moves some logic from resolveLlvmFunction to updateFunc and takes advantage of the iteration we already do that takes into account C ABI lowering, making LLVM parameter attributes accurate for C ABI functions as well as our own unspecified calling convention. Related to #11498.

andrewrk · 2022-05-31T23:54:43Z

With the above commit, we have closed the perf gap:

[nix-shell:~/dev/zig/build-release]$ time ./stage1-release/bin/zig build -p stage2 -Denable-llvm -Dskip-install-lib-files

real	0m51.159s
user	0m48.557s
sys	0m5.381s

[nix-shell:~/dev/zig/build-release]$ time ./stage2-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m42.788s
user	0m42.482s
sys	0m1.756s

[nix-shell:~/dev/zig/build-release]$ time ./stage3-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m42.640s
user	0m41.547s
sys	0m1.752s

However the bloat gap is still there, so I think this is still worth investigating.

Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to #11498.

andrewrk · 2022-06-01T01:44:16Z

Well, maybe the latest three master branch commits helped perf (we will see on https://ziglang.org/perf/ shortly) but they did not help with the binary bloat:

   stage2 master: 8365000 bytes
llvm-param-attrs: 8366968 bytes
         noalias: 8332328 bytes
    slice params: 8375736 bytes
          stage1: 6956744 bytes

andrewrk · 2022-06-01T04:51:04Z

I spent some time examining stage1-generated LLVM IR vs stage2-generated LLVM IR post optimizations with -OReleaseFast. Specifically, I examined the function buildOutputType, which has 2x as many bytes in stage2-generated binaries as in stage1-generated binaries. The LLVM IR generally is remarkably similar. However, I did notice one distinct difference: the stage1-generated LLVM IR was able to construct this pattern for cleanup code:

    if (!init_a()) goto cleanup_a;
    if (!init_b()) goto cleanup_b;
    if (!init_c()) goto cleanup_c;
cleanup_all:
    deinit_c();
cleanup_c:
    deinit_b();
cleanup_b:
    deinit_a();
cleanup_a:
    return;

whereas the stage2-generated LLVM IR had this pattern:

    if (!init_a()) {
        return;
    }
    if (!init_b()) {
        deinit_a();
        return;
    }
    if (!init_c()) {
        deinit_b();
        deinit_a();
        return;
    }
    deinit_c();
    deinit_b();
    deinit_a();
    return;

This would explain why buildOutputType is one of the most bloated functions, given that it has many defer expressions at the beginning of a long function, and many instances of try. It would also potentially explain the bloat in general, given that defer, errdefer, and related functionality is pervasive.

Related: #283

So, my list of untested ideas is now:

peephole optimization for byref struct returned from function call to avoid superfluous alloca instructions
lower try more efficiently by not having a superfluous "then" block
llvm optimizations in stage1 llvm IR are able to construct cleanup labels that goto the next one; stage2-generated LLVM IR ends up duplicating the cleanup logic N times.
audit all callsites of store - airStore is not the only place that
needs logic to elide stores of undef.
anywhere we have a byval value and need a byref value, llvm backend makes a stack-local copy. but in the case that the byval value is a global, we should be able to just take the pointer to it instead of copying it into a local.

andrewrk · 2022-07-19T04:05:30Z

As of latest master (0e26c61) the bloat issue is still here and still pretty severe:

stage2-release + LLVM: 223M
stage3-release + LLVM: 236M

Edit: I forgot to strip. It's mostly extra debug info, which is actually nice. Stripped:

-rwxr-xr-x 1 andy users 149215232 Jul 18 22:01 stage2
-rwxr-xr-x 1 andy users 151276648 Jul 18 22:02 stage3
-rwxr-xr-x 1 andy users 149394920 Jul 18 22:00 stage3-byref

The stage3-byref one which has the bloat issue solved is #12164 + #12145

This branch originally started out as a potential workaround to address #11450. It did not solve that problem, however, it did end up fixing #11498!

andrewrk · 2022-07-19T19:52:07Z

After the merge of #12164, landed in 6fab6c3:

-rwxr-xr-x 1 andy users 149215424 Jul 19 12:00 stage2
-rwxr-xr-x 1 andy users 149443272 Jul 19 12:00 stage3

With a 0.2% difference in stripped binary size, I think we can call this closed.

Speed:

[nix-shell:~/dev/zig/build-release]$ hyperfine 'stage1/bin/zig build -Dskip-install-lib-files' 'stage2-release/bin/zig build -Dskip-install-lib-files' 'stage3-release/bin/zig build -Dskip-install-lib-files'  'stage3-release-byref/bin/zig build -Dskip-install-lib-files' 'stage3-release-only-byref/bin/zig build -Dskip-install-lib-files' --warmup 2 -p 'rm -rf ../zig-cache ../zig-out'
Benchmark 1: stage1/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     53.946 s ±  0.476 s    [User: 53.181 s, System: 2.874 s]
  Range (min … max):   53.274 s … 54.611 s    10 runs
 
Benchmark 2: stage2-release/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     42.926 s ±  0.436 s    [User: 42.235 s, System: 1.101 s]
  Range (min … max):   42.134 s … 43.703 s    10 runs
 
Benchmark 3: stage3-release/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     43.293 s ±  0.476 s    [User: 42.608 s, System: 1.117 s]
  Range (min … max):   42.514 s … 43.928 s    10 runs
 
Benchmark 4: stage3-release-byref/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     35.379 s ±  0.569 s    [User: 34.639 s, System: 1.168 s]
  Range (min … max):   34.818 s … 36.647 s    10 runs
 
Benchmark 5: stage3-release-only-byref/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     35.538 s ±  0.345 s    [User: 34.843 s, System: 1.127 s]
  Range (min … max):   35.040 s … 36.058 s    10 runs
 
Summary
  'stage3-release-byref/bin/zig build -Dskip-install-lib-files' ran
    1.00 ± 0.02 times faster than 'stage3-release-only-byref/bin/zig build -Dskip-install-lib-files'
    1.21 ± 0.02 times faster than 'stage2-release/bin/zig build -Dskip-install-lib-files'
    1.22 ± 0.02 times faster than 'stage3-release/bin/zig build -Dskip-install-lib-files'
    1.52 ± 0.03 times faster than 'stage1/bin/zig build -Dskip-install-lib-files'

Explanation:

stage1 (which is release, don't worry), stage2-release and stage3-release are c26d9f6 (master branch before LLVM: lower optional types as byref=true #12164)
stage3-release-only-byref is 6fab6c3 (current master branch)
stage3-release-byref is LLVM: lower optional types as byref=true #12164 + More ABI size and alignment fixes #12145

Conclusions:

Even before LLVM: lower optional types as byref=true #12164, master branch stage2-release and stage3-release had equivalent performance. After the change, stage3-release is 1.21x faster than stage2-release.
we can safely ignore More ABI size and alignment fixes #12145 as far as perf is concerned.

This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.

Generally, the load instruction may need to make a copy of an isByRef=true value, such as in the case of the following code: ```zig pub fn swap(comptime T: type, a: *T, b: *T) void { const tmp = a.*; a.* = b.*; b.* = tmp; } ``` However, it only needs to do so if there are any instructions which can possibly write to memory. When calling functions with isByRef=true parameters, the AIR code that is generated looks like loads followed directly by call. This allows for a peephole optimization when lowering loads: if the load instruction operates on an isByRef=true type and dies before any side effects occur, then we can safely lower the load as a no-op that returns its operand. This is one out of three changes I intend to make to address #11498. However I will put these changes in separate branches and merge them separately so that we can have three independent points on the perf charts.

This moves some logic from resolveLlvmFunction to updateFunc and takes advantage of the iteration we already do that takes into account C ABI lowering, making LLVM parameter attributes accurate for C ABI functions as well as our own unspecified calling convention. Related to #11498.

Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to #11498.

This branch originally started out as a potential workaround to address #11450. It did not solve that problem, however, it did end up fixing #11498!

This moves some logic from resolveLlvmFunction to updateFunc and takes advantage of the iteration we already do that takes into account C ABI lowering, making LLVM parameter attributes accurate for C ABI functions as well as our own unspecified calling convention. Related to ziglang#11498.

Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to ziglang#11498.

This branch originally started out as a potential workaround to address ziglang#11450. It did not solve that problem, however, it did end up fixing ziglang#11498!

andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization frontend Tokenization, parsing, AstGen, Sema, and Liveness. backend-llvm The LLVM backend outputs an LLVM IR Module. labels Apr 22, 2022

andrewrk added this to the 0.10.0 milestone Apr 22, 2022

andrewrk mentioned this issue May 19, 2022

self-hosted compiler: ship it! #89

Closed

49 tasks

andrewrk mentioned this issue May 31, 2022

LLVM: elide some loads when lowering #11760

Merged

andrewrk mentioned this issue May 31, 2022

LLVM: add readonly, nonnull, align attributes to pointer params #11765

Merged

andrewrk added a commit that referenced this issue Jun 1, 2022

stage2: introduce support for noalias

356a865

Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to #11498.

This was referenced Jun 1, 2022

introduce a "try" ZIR and AIR instruction #11772

Closed

introduce a "try" ZIR and AIR instruction #11783

Merged

andrewrk added a commit that referenced this issue Jul 19, 2022

LLVM: change commentary on isByRef

fe8c3ff

This branch originally started out as a potential workaround to address #11450. It did not solve that problem, however, it did end up fixing #11498!

andrewrk mentioned this issue Jul 19, 2022

LLVM: lower optional types as byref=true #12164

Merged

andrewrk closed this as completed Jul 19, 2022

andrewrk added a commit that referenced this issue Jul 19, 2022

stage2: introduce support for noalias

a9b8a44

Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to #11498.

andrewrk added a commit that referenced this issue Jul 19, 2022

LLVM: change commentary on isByRef

770b3e0

This branch originally started out as a potential workaround to address #11450. It did not solve that problem, however, it did end up fixing #11498!

wooster0 pushed a commit to wooster0/zig that referenced this issue Jul 24, 2022

stage2: introduce support for noalias

d31e4e6

Not implemented yet is enhancements to coerceInMemory to account for noalias parameters. Related to ziglang#11498.

wooster0 pushed a commit to wooster0/zig that referenced this issue Jul 24, 2022

LLVM: change commentary on isByRef

0d3eb3a

This branch originally started out as a potential workaround to address ziglang#11450. It did not solve that problem, however, it did end up fixing ziglang#11498!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498

stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498

andrewrk commented Apr 22, 2022 •

edited

Loading

andrewrk commented Apr 29, 2022

andrewrk commented May 26, 2022 •

edited

Loading

andrewrk commented May 30, 2022 •

edited

Loading

andrewrk commented May 30, 2022

andrewrk commented May 31, 2022

andrewrk commented May 31, 2022

andrewrk commented May 31, 2022 •

edited

Loading

andrewrk commented May 31, 2022 •

edited

Loading

andrewrk commented Jun 1, 2022

andrewrk commented Jun 1, 2022 •

edited

Loading

andrewrk commented Jul 19, 2022 •

edited

Loading

andrewrk commented Jul 19, 2022

stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498

stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498

Comments

andrewrk commented Apr 22, 2022 • edited Loading

andrewrk commented Apr 29, 2022

andrewrk commented May 26, 2022 • edited Loading

stage2

stage3

Observations

andrewrk commented May 30, 2022 • edited Loading

andrewrk commented May 30, 2022

andrewrk commented May 31, 2022

andrewrk commented May 31, 2022

andrewrk commented May 31, 2022 • edited Loading

stage2

stage3

andrewrk commented May 31, 2022 • edited Loading

andrewrk commented Jun 1, 2022

andrewrk commented Jun 1, 2022 • edited Loading

andrewrk commented Jul 19, 2022 • edited Loading

andrewrk commented Jul 19, 2022

andrewrk commented Apr 22, 2022 •

edited

Loading

andrewrk commented May 26, 2022 •

edited

Loading

andrewrk commented May 30, 2022 •

edited

Loading

andrewrk commented May 31, 2022 •

edited

Loading

andrewrk commented May 31, 2022 •

edited

Loading

andrewrk commented Jun 1, 2022 •

edited

Loading

andrewrk commented Jul 19, 2022 •

edited

Loading