Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage2-generated binaries are slower and more bloated than stage1-generated binaries #11498

Closed
Tracked by #89
andrewrk opened this issue Apr 22, 2022 · 12 comments
Closed
Tracked by #89
Labels
backend-llvm The LLVM backend outputs an LLVM IR Module. enhancement Solving this issue will likely involve adding new logic or components to the codebase. frontend Tokenization, parsing, AstGen, Sema, and Liveness. optimization
Milestone

Comments

@andrewrk
Copy link
Member

andrewrk commented Apr 22, 2022

Here I build the self-hosted compiler with -OReleaseFast with stage1 and with stage2, then compare their sizes:

./zig build -p stage2-release -Denable-llvm -Dskip-install-lib-files -Drelease
./stage3/bin/zig build -p stage3-release -Denable-llvm -Dskip-install-lib-files -Drelease
cp stage2-release/bin/zig ~/tmp/zig2
cp stage3-release/bin/zig ~/tmp/zig3

Without stripping, we have:

$ ls -hl ~/tmp/zig{2,3}
-rwxr-xr-x 1 andy users 190M Apr 22 09:27 /home/andy/tmp/zig2
-rwxr-xr-x 1 andy users 205M Apr 22 09:27 /home/andy/tmp/zig3

However that may be due to better debug info in stage2-generated binaries, so let's look at stripped versions:

$ strip ~/tmp/zig2
$ strip ~/tmp/zig3
$ ls -hl ~/tmp/zig{2,3}
-rwxr-xr-x 1 andy users 138M Apr 22 09:30 /home/andy/tmp/zig2
-rwxr-xr-x 1 andy users 140M Apr 22 09:30 /home/andy/tmp/zig3

So there is an extra 2 MiB here unaccounted for. Also, I would have expected smaller binary size because in many ways we generate more efficient LLVM IR in stage2 than in stage1.

It is indeed the .text segment:

$ bloaty zig3 -- zig2
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +3.0% +2.63Mi  +3.0% +2.63Mi    .text
  +0.5%  +143Ki  +0.5%  +143Ki    .rodata
  +0.5% +39.0Ki  +0.5% +39.0Ki    .eh_frame
  +0.1%    +672  +0.1%    +672    .eh_frame_hdr
  +0.5%    +192  +0.5%    +192    .data
  [NEW]     +80  [NEW]     +16    .tdata
  [ = ]       0   +50%      +8    [LOAD #5 [RW]]
  +2.2%      +7  [ = ]       0    .shstrtab
  +4.2%      +1  [ = ]       0    [Unmapped]
  -0.3%      -2  -0.3%      -2    .gnu.version
  -0.2%      -6  -0.2%      -6    .dynstr
  -0.3%      -8  -0.3%      -8    .got.plt
  -0.3%      -8  -0.3%      -8    .hash
  -0.3%     -16  -0.3%     -16    .plt
  -0.3%     -24  -0.3%     -24    .dynsym
  -0.3%     -24  -0.3%     -24    .rela.plt
  [ = ]       0 -38.5%     -40    .tbss
  [ = ]       0  -0.0%    -160    .bss
  -0.8% -71.2Ki  -0.8% -71.3Ki    .data.rel.ro
  +2.0% +2.74Mi  +2.0% +2.74Mi    TOTAL

I also noticed that stage2-generated binaries are slightly worse in terms of runtime performance, possibly for the same reason:

$ time ./stage2-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m44.388s
user	0m43.291s
sys	0m1.522s

$ time ./stage3-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m46.431s
user	0m45.316s
sys	0m1.748s
@andrewrk andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization frontend Tokenization, parsing, AstGen, Sema, and Liveness. backend-llvm The LLVM backend outputs an LLVM IR Module. labels Apr 22, 2022
@andrewrk andrewrk added this to the 0.10.0 milestone Apr 22, 2022
@andrewrk
Copy link
Member Author

I have two hunches for this problem:

  • Redundant panic handling code (as an example the message is duplicated for each instance)
  • Lack of caching @typeInfo results

Other than that, probably the next step here will be reducing the problem to a smaller compilation unit that still exhibits the problem.

@andrewrk
Copy link
Member Author

andrewrk commented May 26, 2022

I played around with bloaty and made some progress on this issue. The comparison feature of bloaty is not useful so I have instead the output of stage2 and stage3 side by side. These are compiled without LLVM, in release mode, with a hack to enable build ids, required by bloaty (see #3047):

--- a/src/link/Elf.zig
+++ b/src/link/Elf.zig
@@ -1450,6 +1450,8 @@ fn linkWithLLD(self: *Elf, comp: *Compilation, prog_node: *std.Progress.Node) !v
         if (self.base.options.output_mode == .Exe) {
             try argv.append("-z");
             try argv.append(try std.fmt.allocPrint(arena, "stack-size={d}", .{stack_size}));
+
+            try argv.append("--build-id");
         }
 
         if (self.base.options.image_base_override) |image_base| {

stage2

[nix-shell:~/tmp]$ bloaty zig2-no-llvm --debug-file=$HOME/Downloads/zig/build-release/stage2-release-no-llvm/bin/zig  -d symbols
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.2%  4.95Mi  74.2%  4.97Mi    [16217 Others]
   6.7%   459Ki   6.7%   459Ki    [section .rodata]
   4.0%   274Ki   4.0%   274Ki    buildOutputType
   2.5%   170Ki   2.5%   170Ki    Sema.analyzeBodyInner
   1.7%   116Ki   1.7%   116Ki    std.fmt.Formatter(type.Type.dump).format.8702
   1.4%  95.7Ki   1.4%  95.7Ki    std.fmt.Formatter(value.Value.dump).format.19917
   1.0%  66.4Ki   1.0%  66.4Ki    arch.wasm.CodeGen.genBody
   1.0%  66.3Ki   1.0%  66.3Ki    __unnamed_262
   0.8%  56.6Ki   0.8%  56.6Ki    __unnamed_7156
   0.8%  52.6Ki   0.8%  52.6Ki    link.Wasm.flush
   0.7%  50.0Ki   0.7%  50.0Ki    arch.x86_64.CodeGen.genBody
   0.6%  43.2Ki   0.6%  43.2Ki    std.sort.sort
   0.6%  40.6Ki   0.6%  40.6Ki    type.Type.dump.13097
   0.5%  37.6Ki   0.5%  37.6Ki    std.sort.sort.2026
   0.5%  37.1Ki   0.5%  37.1Ki    std.sort.sort.9839
   0.5%  37.0Ki   0.5%  37.0Ki    std.sort.sort.2429
   0.5%  35.1Ki   0.5%  35.1Ki    [section .text]
   0.5%  34.1Ki   0.5%  34.1Ki    main
   0.5%  34.0Ki   0.5%  34.0Ki    arch.arm.CodeGen.genBody
   0.5%  31.9Ki   0.5%  31.9Ki    arch.aarch64.CodeGen.genBody
   0.4%  30.1Ki   0.4%  30.1Ki    Compilation.workerAstGenFile
 100.0%  6.68Mi 100.0%  6.70Mi    TOTAL

stage3

[nix-shell:~/tmp]$ bloaty zig3-no-llvm --debug-file=$HOME/Downloads/zig/build-release/stage3-release-no-llvm/bin/zig  -d symbols 
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  75.1%  6.14Mi  75.1%  6.16Mi    [14778 Others]
   8.0%   672Ki   8.0%   672Ki    [section .rodata]
   5.1%   424Ki   5.1%   424Ki    main.buildOutputType
   2.3%   190Ki   2.3%   190Ki    Sema.analyzeBodyInner
   1.2%   104Ki   1.2%   104Ki    arch.wasm.CodeGen.genBody
   0.8%  66.3Ki   0.8%  66.3Ki    clang_options_data.data__anon_12018
   0.7%  58.7Ki   0.7%  58.7Ki    Compilation.generateBuiltinZigSource
   0.7%  56.6Ki   0.7%  56.6Ki    link.C.zig_h__anon_38434
   0.7%  56.5Ki   0.7%  56.5Ki    arch.x86_64.CodeGen.genBody
   0.6%  54.3Ki   0.6%  54.3Ki    codegen.c.genBody
   0.6%  47.5Ki   0.6%  47.5Ki    main.main
   0.5%  43.1Ki   0.5%  43.1Ki    sort.sort__anon_111851
   0.5%  40.4Ki   0.5%  40.4Ki    sort.sort__anon_35809
   0.5%  37.9Ki   0.5%  37.9Ki    codegen.generateFunction
   0.4%  36.9Ki   0.4%  36.9Ki    arch.arm.CodeGen.genBody
   0.4%  36.6Ki   0.4%  36.6Ki    Sema.zirSwitchBlock
   0.4%  36.3Ki   0.4%  36.3Ki    arch.aarch64.CodeGen.genBody
   0.4%  33.1Ki   0.4%  33.1Ki    [section .text]
   0.4%  32.0Ki   0.4%  32.0Ki    Compilation.update
   0.4%  31.6Ki   0.4%  31.6Ki    Compilation.processOneJob
   0.4%  31.2Ki   0.4%  31.2Ki    Compilation.workerAstGenFile
 100.0%  8.19Mi 100.0%  8.20Mi    TOTAL

Observations

  • Runtime safety is not relevant here since these are release builds.
  • stage3 actually generates less garbage (fewer symbols) which is nice.
  • stage3 generates 46% more bytes of global constants (section .rodata).
  • stage2 is still winning the bloat-off despite the Type/Value dump function being disabled (due to a compiler bug) in stage3
  • Congrats to stage3 for naming @embedFile("zig.h") "link.C.zig_h__anon_38434" instead of "__unnamed_7156" (see also move zig.h to become an installation file that can be changed at runtime #11643)
    • same thing for the symbol name "clang_options_data.data__anon_12018" instead of "__unnamed_262"
  • It looks like nearly every function is more bloated in stage3 than in stage2. We should be able to take any single function as an example and compare the LLVM IR.

Final observation, we may want to look into this as a prerequisite:

zig/src/codegen/llvm.zig

Lines 2216 to 2217 in 97816e3

// Set parameter attributes.
// TODO: more attributes. see codegen.cpp `make_fn_llvm_value`.

Remember these are release builds - if we are not setting enough LLVM attributes then the optimizer might not be able to inline & delete code as much.

@andrewrk
Copy link
Member Author

andrewrk commented May 30, 2022

OK I think I narrowed this down a little bit further. I had a look at buildOutputType which is almost 2x bigger in a stage2-generated binary. One noticeable difference is that the stage1-generated LLVM IR has a total of 1224 alloca instructions (stack variables) whereas the stage2-generated LLVM IR has 2345.

In particular, the stage1-generated LLVM IR has 22 alloca instructions with an ArrayListAligned type. However the stage2-generated LLVM IR has 1753 alloca instructions with an ArrayListAligned type. That definitely seems... suboptimal. I suspect if we figure out why the LLVM backend is creating so many alloca instructions, we may have solved at least a big chunk of this issue.

Another random observation: currently we lower slice parameters as struct values in LLVM IR, however, if we lower them as separate ptr/len parameters, we could put the "nonnull" attribute on the pointer (and "readonly" if it is a const slice). This would be an improvement over stage1. Aha, this is #561

@andrewrk
Copy link
Member Author

Here is a small snippet of code we can look at:

fn doTheTest() !void {
    var list = std.ArrayList(?[]const u8).init(std.heap.page_allocator);
    defer list.deinit();
}

stage1 LLVM IR:

; Function Attrs: nobuiltin nounwind
define internal fastcc i16 @doTheTest() unnamed_addr #1 {
Entry:
  %result = alloca i16, align 2
  %list = alloca %"std.array_list.ArrayListAligned(?[]const u8,null)", align 8
  call fastcc void @"std.array_list.ArrayListAligned(?[]const u8,null).init"(%"std.array_list.ArrayListAligned(?[]const u8,null)"* sret(%"std.array_list.ArrayListAligned(?[]const u8,null)") %list, %std.mem.Allocator* @1)
  store i16 0, i16* %result, align 2
  call fastcc void @"std.array_list.ArrayListAligned(?[]const u8,null).deinit"(%"std.array_list.ArrayListAligned(?[]const u8,null)"* %list)
  ret i16 0
}

stage2 LLVM IR:

; Function Attrs: nounwind
define internal fastcc i16 @test2.doTheTest() unnamed_addr #0 {
Entry:
  %0 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
  %1 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
  %2 = alloca %"array_list.ArrayListAligned(?[]const u8,null)", align 8
  call fastcc void @"array_list.ArrayListAligned(?[]const u8,null).init"(%"array_list.ArrayListAligned(?[]const u8,null)"* sret(%"array_list.ArrayListAligned(?[]const u8,null)") %1, %mem.Allocator.mem.Allocator { i8* undef, %mem.Allocator.VTable* @heap.PageAllocator.vtable })
  %3 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %2 to i8*
  %4 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %3, i8* align 8 %4, i64 40, i1 false)
  %5 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %0 to i8*
  %6 = bitcast %"array_list.ArrayListAligned(?[]const u8,null)"* %2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %5, i8* align 8 %6, i64 40, i1 false)
  call fastcc void @"array_list.ArrayListAligned(?[]const u8,null).deinit"(%"array_list.ArrayListAligned(?[]const u8,null)"* %0)
  ret i16 0
}

AIR:

# Begin Function AIR: doTheTest:
# Total AIR+Liveness bytes: 980B
# AIR Instructions:         64 (576B)
# AIR Extra Data:           13 (52B)
# AIR Values Bytes:         26 (208B)
# Liveness tomb_bits:       32B
# Liveness Extra Data:      0 (0B)
# Liveness special table:   0 (0B)
  %0 = const_ty((inferred_alloc_mut))
  %2 = const_ty(*const type)
  %3 = constant(*const type, (decl_ref Module.Decl.Index(136)))
  %4 = constant(type, (struct decl=Module.Decl.Index(0)))
  %5 = const_ty(*const fn(type) type)
  %6 = constant(*const fn(type) type, (decl_ref Module.Decl.Index(3)))
  %7 = const_ty(fn(type) type)
  %8 = constant(fn(type) type, (function decl=Module.Decl.Index(228)))
  %9 = const_ty(?[]const u8)
  %11 = const_ty(*type)
  %12 = constant(*type, (decl_ref_mut Module.Decl.Index(253)))
  %13 = const_ty(*const fn(type, ?u29) type)
  %14 = constant(*const fn(type, ?u29) type, (decl_ref Module.Decl.Index(229)))
  %15 = const_ty(fn(type, ?u29) type)
  %16 = constant(fn(type, ?u29) type, (function decl=Module.Decl.Index(229)))
  %17 = const_ty(?u29)
  %19 = const_ty(u29)
  %20 = const_ty(?u29)
  %21 = const_ty(?u29)
  %22 = constant(?u29, null)
  %23 = const_ty(*const type)
  %24 = constant(*const type, (decl_ref Module.Decl.Index(254)))
  %25 = constant(type, (struct decl=Module.Decl.Index(254)))
  %26 = constant(type, (struct decl=Module.Decl.Index(254)))
  %27 = const_ty(*const type)
  %28 = constant(*const type, (decl_ref Module.Decl.Index(298)))
  %29 = constant(type, (struct decl=Module.Decl.Index(254)))
  %30 = const_ty(*const fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)))
  %31 = constant(*const fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)), (decl_ref Module.Decl.Index(257)))
  %32 = const_ty(fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)))
  %33 = constant(fn((struct decl=Module.Decl.Index(511))) (struct decl=Module.Decl.Index(254)), (function decl=Module.Decl.Index(257)))
  %34 = const_ty((struct decl=Module.Decl.Index(511)))
  %35 = const_ty(*const type)
  %36 = constant(*const type, (decl_ref Module.Decl.Index(136)))
  %37 = constant(type, (struct decl=Module.Decl.Index(0)))
  %38 = const_ty(*const type)
  %39 = constant(*const type, (decl_ref Module.Decl.Index(66)))
  %40 = constant(type, (struct decl=Module.Decl.Index(560)))
  %41 = const_ty(*const (struct decl=Module.Decl.Index(511)))
  %42 = constant(*const (struct decl=Module.Decl.Index(511)), (decl_ref Module.Decl.Index(587)))
  %43 = const_ty((struct decl=Module.Decl.Index(511)))
  %44 = constant((struct decl=Module.Decl.Index(511)), (aggregate))
  %45 = const_ty((struct decl=Module.Decl.Index(254)))
  %47 = const_ty(*(struct decl=Module.Decl.Index(254)))
  %49 = const_ty((struct decl=Module.Decl.Index(254)))
  %51 = const_ty((struct decl=Module.Decl.Index(254)))
  %52 = const_ty(*const fn((struct decl=Module.Decl.Index(254))) void)
  %53 = constant(*const fn((struct decl=Module.Decl.Index(254))) void, (decl_ref Module.Decl.Index(259)))
  %54 = const_ty(fn((struct decl=Module.Decl.Index(254))) void)
  %55 = constant(fn((struct decl=Module.Decl.Index(254))) void, (function decl=Module.Decl.Index(259)))
  %56 = const_ty((struct decl=Module.Decl.Index(254)))
  %58 = const_ty(bound_fn)
  %59 = constant(bound_fn, (bound_fn %Zir.Inst.Ref(130)(%Zir.Inst.Ref(132)))
  %61 = const_ty((error_set_inferred func=Module.Decl.Index(151))!void)
  %62 = constant((error_set_inferred func=Module.Decl.Index(151))!void, (eu_payload) {})

  %1 = alloc(*(struct decl=Module.Decl.Index(254)))
  %46 = call(%33!, [%44!])
  %48 = bitcast(*(struct decl=Module.Decl.Index(254)), %1)
  %50!= store(%48!, %46!)
  %57 = load((struct decl=Module.Decl.Index(254)), %1!)
  %60!= call(%55!, [%57!])
  %63!= ret(%62!)
# End Function AIR: doTheTest

Observations

  • In the stage2 LLVM IR there are two extra alloca instructions for the array list
    • One is used for the return value of the function call to init(), which is then copied to the real array list local variable.
    • The other is used for the parameter to deinit(). A copy of the real array list local variable is made solely for the deinit() call.
  • The way that defer deinit() is currently lowered will insert a call for every exit point in the scope, this includes every try and return. This explains why buildOutputType - which is a pretty long function that has many exit points - creates so many ArrayList objects. A copy is being made for every deinit() call.

We should be able to improve the LLVM backend by lowering these instructions differently so that in init() the function call directly initializes the canonical local variable that represents the struct, and the deinit() function directly takes the pointer to the struct instead of a copy.

I also noticed that the deinit() function in stage1 has nonnull readonly align 8 on the parameter, however the stage2 LLVM IR only has nonnull.

@andrewrk
Copy link
Member Author

const S = struct {
    a: i32,
    b: i32,
    c: i32,
};

test {
    var s: S = undefined;
    _ = s;
}

I also noticed that this example emits a memset even when compiled with -OReleaseFast. This might be the actual main problem.

andrewrk added a commit that referenced this issue May 31, 2022
Generally, the load instruction may need to make a copy of an
isByRef=true value, such as in the case of the following code:

```zig
pub fn swap(comptime T: type, a: *T, b: *T) void {
    const tmp = a.*;
    a.* = b.*;
    b.* = tmp;
}
```

However, it only needs to do so if there are any instructions which can
possibly write to memory. When calling functions with isByRef=true
parameters, the AIR code that is generated looks like loads followed
directly by call.

This allows for a peephole optimization when lowering loads: if the load
instruction operates on an isByRef=true type and dies before any side effects
occur, then we can safely lower the load as a no-op that returns its
operand.

This is one out of three changes I intend to make to address #11498.
However I will put these changes in separate branches and merge them
separately so that we can have three independent points on the perf
charts.
andrewrk added a commit that referenced this issue May 31, 2022
Generally, the load instruction may need to make a copy of an
isByRef=true value, such as in the case of the following code:

```zig
pub fn swap(comptime T: type, a: *T, b: *T) void {
    const tmp = a.*;
    a.* = b.*;
    b.* = tmp;
}
```

However, it only needs to do so if there are any instructions which can
possibly write to memory. When calling functions with isByRef=true
parameters, the AIR code that is generated looks like loads followed
directly by call.

This allows for a peephole optimization when lowering loads: if the load
instruction operates on an isByRef=true type and dies before any side effects
occur, then we can safely lower the load as a no-op that returns its
operand.

This is one out of three changes I intend to make to address #11498.
However I will put these changes in separate branches and merge them
separately so that we can have three independent points on the perf
charts.
andrewrk added a commit that referenced this issue May 31, 2022
This is one out of three changes I intend to make to address #11498.
However I will put these changes in separate branches and merge them
separately so that we can have three independent points on the perf
charts.
@andrewrk
Copy link
Member Author

Some binary size numbers after those two commits:

           stage1: 6925560 bytes
    stage2 master: 8500392 bytes
llvm-elide-memset: 8479208 bytes
  llvm-elide-load: 8329608 bytes

So, these two things helped but did not bring us all the way home yet.

andrewrk added a commit that referenced this issue May 31, 2022
Generally, the load instruction may need to make a copy of an
isByRef=true value, such as in the case of the following code:

```zig
pub fn swap(comptime T: type, a: *T, b: *T) void {
    const tmp = a.*;
    a.* = b.*;
    b.* = tmp;
}
```

However, it only needs to do so if there are any instructions which can
possibly write to memory. When calling functions with isByRef=true
parameters, the AIR code that is generated looks like loads followed
directly by call.

This allows for a peephole optimization when lowering loads: if the load
instruction operates on an isByRef=true type and dies before any side effects
occur, then we can safely lower the load as a no-op that returns its
operand.

This is one out of three changes I intend to make to address #11498.
However I will put these changes in separate branches and merge them
separately so that we can have three independent points on the perf
charts.
@andrewrk
Copy link
Member Author

andrewrk commented May 31, 2022

Worth noting - the above changes seem to have had little to no effect on the relative size of buildOutputType which was the function whose LLVM IR I was inspecting to learn what kinds of changes to make:

stage2

[nix-shell:~/tmp]$ bloaty zig2 --debug-file=$HOME/dev/zig/build-release/stage2-release/bin/zig  -d symbols
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.5%  4.94Mi  74.5%  4.96Mi    [16239 Others]
   6.8%   459Ki   6.7%   459Ki    [section .rodata]
   4.1%   277Ki   4.1%   277Ki    buildOutputType
   2.0%   132Ki   1.9%   132Ki    Sema.analyzeBodyInner
   1.7%   116Ki   1.7%   116Ki    std.fmt.Formatter(type.Type.dump).format.8715
   1.4%  95.7Ki   1.4%  95.7Ki    std.fmt.Formatter(value.Value.dump).format.19926
   1.0%  66.5Ki   1.0%  66.5Ki    arch.wasm.CodeGen.genBody
   1.0%  66.3Ki   1.0%  66.3Ki    __unnamed_263
   0.8%  56.6Ki   0.8%  56.6Ki    __unnamed_7179
   0.8%  52.6Ki   0.8%  52.6Ki    link.Wasm.flush
   0.7%  50.0Ki   0.7%  50.0Ki    arch.x86_64.CodeGen.genBody
   0.6%  43.2Ki   0.6%  43.2Ki    std.sort.sort
   0.6%  40.6Ki   0.6%  40.6Ki    type.Type.dump.13124
   0.6%  37.6Ki   0.6%  37.6Ki    std.sort.sort.2032
   0.5%  37.1Ki   0.5%  37.1Ki    std.sort.sort.9851
   0.5%  37.0Ki   0.5%  37.0Ki    std.sort.sort.2435
   0.5%  35.2Ki   0.5%  35.2Ki    [section .text]
   0.5%  34.1Ki   0.5%  34.1Ki    main
   0.5%  34.0Ki   0.5%  34.0Ki    arch.arm.CodeGen.genBody
   0.5%  32.0Ki   0.5%  32.0Ki    arch.aarch64.CodeGen.genBody
   0.4%  30.1Ki   0.4%  30.1Ki    Compilation.workerAstGenFile
 100.0%  6.64Mi 100.0%  6.65Mi    TOTAL

stage3

[nix-shell:~/tmp]$ bloaty zig3 --debug-file=$HOME/dev/zig/build-release/stage3-release/bin/zig  -d symbols
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  75.2%  6.00Mi  75.2%  6.01Mi    [14783 Others]
   8.4%   683Ki   8.3%   683Ki    [section .rodata]
   5.4%   438Ki   5.4%   438Ki    main.buildOutputType
   1.7%   134Ki   1.6%   134Ki    Sema.analyzeBodyInner
   1.2%  95.3Ki   1.2%  95.3Ki    arch.wasm.CodeGen.genBody
   0.8%  66.3Ki   0.8%  66.3Ki    clang_options_data.data__anon_12016
   0.7%  58.4Ki   0.7%  58.4Ki    Compilation.generateBuiltinZigSource
   0.7%  56.6Ki   0.7%  56.6Ki    link.C.zig_h__anon_37409
   0.7%  53.4Ki   0.7%  53.4Ki    codegen.c.genBody
   0.6%  51.4Ki   0.6%  51.4Ki    arch.x86_64.CodeGen.genBody
   0.6%  45.9Ki   0.6%  45.9Ki    main.main
   0.5%  43.1Ki   0.5%  43.1Ki    RangeSet.spans
   0.5%  38.4Ki   0.5%  38.4Ki    sort.sort__anon_33752
   0.4%  36.2Ki   0.4%  36.2Ki    codegen.generateFunction
   0.4%  35.0Ki   0.4%  35.0Ki    Sema.zirSwitchBlock
   0.4%  34.3Ki   0.4%  34.3Ki    arch.arm.CodeGen.genBody
   0.4%  31.8Ki   0.4%  31.8Ki    [section .text]
   0.4%  31.8Ki   0.4%  31.8Ki    arch.aarch64.CodeGen.genBody
   0.4%  31.0Ki   0.4%  31.0Ki    Compilation.processOneJob
   0.4%  30.4Ki   0.4%  30.4Ki    Compilation.update
   0.4%  30.4Ki   0.4%  30.4Ki    Compilation.workerAstGenFile
 100.0%  7.98Mi 100.0%  7.99Mi    TOTAL

andrewrk added a commit that referenced this issue May 31, 2022
This moves some logic from resolveLlvmFunction to updateFunc and takes
advantage of the iteration we already do that takes into account C ABI
lowering, making LLVM parameter attributes accurate for C ABI functions
as well as our own unspecified calling convention.

Related to #11498.
@andrewrk
Copy link
Member Author

andrewrk commented May 31, 2022

With the above commit, we have closed the perf gap:

[nix-shell:~/dev/zig/build-release]$ time ./stage1-release/bin/zig build -p stage2 -Denable-llvm -Dskip-install-lib-files

real	0m51.159s
user	0m48.557s
sys	0m5.381s

[nix-shell:~/dev/zig/build-release]$ time ./stage2-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m42.788s
user	0m42.482s
sys	0m1.756s

[nix-shell:~/dev/zig/build-release]$ time ./stage3-release/bin/zig build -p stage3 -Dskip-install-lib-files -Denable-llvm

real	0m42.640s
user	0m41.547s
sys	0m1.752s

However the bloat gap is still there, so I think this is still worth investigating.

andrewrk added a commit that referenced this issue Jun 1, 2022
Not implemented yet is enhancements to coerceInMemory to account for
noalias parameters.

Related to #11498.
@andrewrk
Copy link
Member Author

andrewrk commented Jun 1, 2022

Well, maybe the latest three master branch commits helped perf (we will see on https://ziglang.org/perf/ shortly) but they did not help with the binary bloat:

   stage2 master: 8365000 bytes
llvm-param-attrs: 8366968 bytes
         noalias: 8332328 bytes
    slice params: 8375736 bytes
          stage1: 6956744 bytes

@andrewrk
Copy link
Member Author

andrewrk commented Jun 1, 2022

I spent some time examining stage1-generated LLVM IR vs stage2-generated LLVM IR post optimizations with -OReleaseFast. Specifically, I examined the function buildOutputType, which has 2x as many bytes in stage2-generated binaries as in stage1-generated binaries. The LLVM IR generally is remarkably similar. However, I did notice one distinct difference: the stage1-generated LLVM IR was able to construct this pattern for cleanup code:

    if (!init_a()) goto cleanup_a;
    if (!init_b()) goto cleanup_b;
    if (!init_c()) goto cleanup_c;
cleanup_all:
    deinit_c();
cleanup_c:
    deinit_b();
cleanup_b:
    deinit_a();
cleanup_a:
    return;

whereas the stage2-generated LLVM IR had this pattern:

    if (!init_a()) {
        return;
    }
    if (!init_b()) {
        deinit_a();
        return;
    }
    if (!init_c()) {
        deinit_b();
        deinit_a();
        return;
    }
    deinit_c();
    deinit_b();
    deinit_a();
    return;

This would explain why buildOutputType is one of the most bloated functions, given that it has many defer expressions at the beginning of a long function, and many instances of try. It would also potentially explain the bloat in general, given that defer, errdefer, and related functionality is pervasive.

Related: #283


So, my list of untested ideas is now:

  • peephole optimization for byref struct returned from function call to avoid superfluous alloca instructions
  • lower try more efficiently by not having a superfluous "then" block
  • llvm optimizations in stage1 llvm IR are able to construct cleanup labels that goto the next one; stage2-generated LLVM IR ends up duplicating the cleanup logic N times.
  • audit all callsites of store - airStore is not the only place that
    needs logic to elide stores of undef.
  • anywhere we have a byval value and need a byref value, llvm backend makes a stack-local copy. but in the case that the byval value is a global, we should be able to just take the pointer to it instead of copying it into a local.

@andrewrk
Copy link
Member Author

andrewrk commented Jul 19, 2022

As of latest master (0e26c61) the bloat issue is still here and still pretty severe:

  • stage2-release + LLVM: 223M
  • stage3-release + LLVM: 236M

Edit: I forgot to strip. It's mostly extra debug info, which is actually nice. Stripped:

-rwxr-xr-x 1 andy users 149215232 Jul 18 22:01 stage2
-rwxr-xr-x 1 andy users 151276648 Jul 18 22:02 stage3
-rwxr-xr-x 1 andy users 149394920 Jul 18 22:00 stage3-byref

The stage3-byref one which has the bloat issue solved is #12164 + #12145

andrewrk added a commit that referenced this issue Jul 19, 2022
This branch originally started out as a potential workaround to
address #11450. It did not solve that problem, however, it did end
up fixing #11498!
@andrewrk
Copy link
Member Author

After the merge of #12164, landed in 6fab6c3:

-rwxr-xr-x 1 andy users 149215424 Jul 19 12:00 stage2
-rwxr-xr-x 1 andy users 149443272 Jul 19 12:00 stage3

With a 0.2% difference in stripped binary size, I think we can call this closed.

Speed:

[nix-shell:~/dev/zig/build-release]$ hyperfine 'stage1/bin/zig build -Dskip-install-lib-files' 'stage2-release/bin/zig build -Dskip-install-lib-files' 'stage3-release/bin/zig build -Dskip-install-lib-files'  'stage3-release-byref/bin/zig build -Dskip-install-lib-files' 'stage3-release-only-byref/bin/zig build -Dskip-install-lib-files' --warmup 2 -p 'rm -rf ../zig-cache ../zig-out'
Benchmark 1: stage1/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     53.946 s ±  0.476 s    [User: 53.181 s, System: 2.874 s]
  Range (min … max):   53.274 s … 54.611 s    10 runs
 
Benchmark 2: stage2-release/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     42.926 s ±  0.436 s    [User: 42.235 s, System: 1.101 s]
  Range (min … max):   42.134 s … 43.703 s    10 runs
 
Benchmark 3: stage3-release/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     43.293 s ±  0.476 s    [User: 42.608 s, System: 1.117 s]
  Range (min … max):   42.514 s … 43.928 s    10 runs
 
Benchmark 4: stage3-release-byref/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     35.379 s ±  0.569 s    [User: 34.639 s, System: 1.168 s]
  Range (min … max):   34.818 s … 36.647 s    10 runs
 
Benchmark 5: stage3-release-only-byref/bin/zig build -Dskip-install-lib-files
  Time (mean ± σ):     35.538 s ±  0.345 s    [User: 34.843 s, System: 1.127 s]
  Range (min … max):   35.040 s … 36.058 s    10 runs
 
Summary
  'stage3-release-byref/bin/zig build -Dskip-install-lib-files' ran
    1.00 ± 0.02 times faster than 'stage3-release-only-byref/bin/zig build -Dskip-install-lib-files'
    1.21 ± 0.02 times faster than 'stage2-release/bin/zig build -Dskip-install-lib-files'
    1.22 ± 0.02 times faster than 'stage3-release/bin/zig build -Dskip-install-lib-files'
    1.52 ± 0.03 times faster than 'stage1/bin/zig build -Dskip-install-lib-files'

Explanation:

Conclusions:

andrewrk added a commit that referenced this issue Jul 19, 2022
This is one out of three changes I intend to make to address #11498.
However I will put these changes in separate branches and merge them
separately so that we can have three independent points on the perf
charts.
andrewrk added a commit that referenced this issue Jul 19, 2022
Generally, the load instruction may need to make a copy of an
isByRef=true value, such as in the case of the following code:

```zig
pub fn swap(comptime T: type, a: *T, b: *T) void {
    const tmp = a.*;
    a.* = b.*;
    b.* = tmp;
}
```

However, it only needs to do so if there are any instructions which can
possibly write to memory. When calling functions with isByRef=true
parameters, the AIR code that is generated looks like loads followed
directly by call.

This allows for a peephole optimization when lowering loads: if the load
instruction operates on an isByRef=true type and dies before any side effects
occur, then we can safely lower the load as a no-op that returns its
operand.

This is one out of three changes I intend to make to address #11498.
However I will put these changes in separate branches and merge them
separately so that we can have three independent points on the perf
charts.
andrewrk added a commit that referenced this issue Jul 19, 2022
This moves some logic from resolveLlvmFunction to updateFunc and takes
advantage of the iteration we already do that takes into account C ABI
lowering, making LLVM parameter attributes accurate for C ABI functions
as well as our own unspecified calling convention.

Related to #11498.
andrewrk added a commit that referenced this issue Jul 19, 2022
Not implemented yet is enhancements to coerceInMemory to account for
noalias parameters.

Related to #11498.
andrewrk added a commit that referenced this issue Jul 19, 2022
This branch originally started out as a potential workaround to
address #11450. It did not solve that problem, however, it did end
up fixing #11498!
wooster0 pushed a commit to wooster0/zig that referenced this issue Jul 24, 2022
This moves some logic from resolveLlvmFunction to updateFunc and takes
advantage of the iteration we already do that takes into account C ABI
lowering, making LLVM parameter attributes accurate for C ABI functions
as well as our own unspecified calling convention.

Related to ziglang#11498.
wooster0 pushed a commit to wooster0/zig that referenced this issue Jul 24, 2022
Not implemented yet is enhancements to coerceInMemory to account for
noalias parameters.

Related to ziglang#11498.
wooster0 pushed a commit to wooster0/zig that referenced this issue Jul 24, 2022
This branch originally started out as a potential workaround to
address ziglang#11450. It did not solve that problem, however, it did end
up fixing ziglang#11498!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend-llvm The LLVM backend outputs an LLVM IR Module. enhancement Solving this issue will likely involve adding new logic or components to the codebase. frontend Tokenization, parsing, AstGen, Sema, and Liveness. optimization
Projects
None yet
Development

No branches or pull requests

1 participant