Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-7b Crashes during IREE Compilation #16317

Closed
vivekkhandelwal1 opened this issue Feb 5, 2024 · 10 comments · Fixed by llvm/llvm-project#80848
Closed

Llama-7b Crashes during IREE Compilation #16317

vivekkhandelwal1 opened this issue Feb 5, 2024 · 10 comments · Fixed by llvm/llvm-project#80848
Assignees
Labels
bug 🐞 Something isn't working

Comments

@vivekkhandelwal1
Copy link
Member

What happened?

While compilation, I'm getting the following crash:

iree-compile: /home/azureuser/work/iree-vivek/third_party/llvm-project/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp:843: static void mlir::tensor::EmptyOp::build(mlir:
:OpBuilder &, mlir::OperationState &, ArrayRef<int64_t>, mlir::Type, mlir::Attribute): Assertion `all_of(staticShape, [](int64_t sz) { return !ShapedType::isDynamic(sz); }) && "expected only static sizes"' failed.                                                                                                                 Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.                                                                    
  #0 0x00007fd3a8bc5add llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/azureuser/work/iree-vivek/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:11                                                                                                                                                           #1 0x00007fd3a8bc5fcb PrintStackTraceSignalHandler(void*) /home/azureuser/work/iree-vivek/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:798:1       
  #2 0x00007fd3a8bc3ff6 llvm::sys::RunSignalHandlers() /home/azureuser/work/iree-vivek/third_party/llvm-project/llvm/lib/Support/Signals.cpp:105:5                   #3 0x00007fd3a8bc67e5 SignalHandler(int) /home/azureuser/work/iree-vivek/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1                        
  #4 0x00007fd39ca42520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)                                                                                                    #5 0x00007fd39ca969fc pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x969fc)                                                                                     
  #6 0x00007fd39ca42476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)                                                                                          
  #7 0x00007fd39ca287f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)                                                                                            
  #8 0x00007fd39ca2871b (/lib/x86_64-linux-gnu/libc.so.6+0x2871b)                                                                                                  
  #9 0x00007fd39ca39e96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)                                                                                                  
 #10 0x00007fd3b079dcf8 mlir::tensor::EmptyOp::build(mlir::OpBuilder&, mlir::OperationState&, llvm::ArrayRef<long>, mlir::Type, mlir::Attribute) /home/azureuser/wo
rk/iree-vivek/third_party/llvm-project/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp:844:9                                                                              
 #11 0x00007fd3a98d1731 mlir::tensor::EmptyOp mlir::OpBuilder::create<mlir::tensor::EmptyOp, llvm::SmallVector<long, 6u>&, mlir::Type&>(mlir::Location, llvm::Small
Vector<long, 6u>&, mlir::Type&) /home/azureuser/work/iree-vivek/third_party/llvm-project/mlir/include/mlir/IR/Builders.h:500:5                                     
 #12 0x00007fd3ae30d36e mlir::linalg::GeneralizeOuterUnitDimsPackOpPattern::matchAndRewrite(mlir::tensor::PackOp, mlir::PatternRewriter&) const /home/azureuser/wor
k/iree-vivek/third_party/llvm-project/mlir/lib/Dialect/Linalg/Transforms/Transforms.cpp:1214:26                                                                    
 #13 0x00007fd3ac06246b mlir::detail::OpOrInterfaceRewritePatternBase<mlir::tensor::PackOp>::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&) const /home/
azureuser/work/iree-vivek/third_party/llvm-project/mlir/include/mlir/IR/PatternMatch.h:330:12                                                                      
 #14 0x00007fd3afec5886 mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::f
unction_ref<void (mlir::Pattern const&)>, llvm::function_ref<mlir::LogicalResult (mlir::Pattern const&)>)::$_1::operator()() const /home/azureuser/work/iree-vivek/
third_party/llvm-project/mlir/lib/Rewrite/PatternApplicator.cpp:208:31

Steps to reproduce your issue

IR is available at: https://gist.github.com/vivekkhandelwal1/185ed5eb3f76c7d88fd37a489185d0fb

Command to reproduce the crash:

./build/tools/iree-compile --iree-hal-target-backends=llvm-cpu --iree-input-type=tm_tensor --iree-util-zero-fill-elided-attrs llama_7b_linalg_elided.mlir -o llama_7b_fp32.vmfb

What component(s) does this issue relate to?

MLIR, Compiler

Version information

No response

Additional context

No response

@vivekkhandelwal1 vivekkhandelwal1 added the bug 🐞 Something isn't working label Feb 5, 2024
@benvanik
Copy link
Collaborator

benvanik commented Feb 5, 2024

guessing this is an upstream bug in linalg::GeneralizeOuterUnitDimsPackOpPattern

@hanhanW hanhanW self-assigned this Feb 5, 2024
@hanhanW
Copy link
Contributor

hanhanW commented Feb 5, 2024

I can not reproduce the issue when I specify the CPU to cascadelake. Here is the command I used:

iree-compile --output-format=vm-bytecode \
  --iree-hal-target-backends=llvm-cpu \
  --iree-input-type=tm_tensor \
  --iree-util-zero-fill-elided-attrs \
  --iree-llvmcpu-target-cpu=cascadelake \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  ~/llama_7b_linalg_elided.mlir -o /tmp/z.vmfb

What is the target cpu?

@vivekkhandelwal1
Copy link
Member Author

I can not reproduce the issue when I specify the CPU to cascadelake. Here is the command I used:

iree-compile --output-format=vm-bytecode \
  --iree-hal-target-backends=llvm-cpu \
  --iree-input-type=tm_tensor \
  --iree-util-zero-fill-elided-attrs \
  --iree-llvmcpu-target-cpu=cascadelake \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  ~/llama_7b_linalg_elided.mlir -o /tmp/z.vmfb

What is the target cpu?

This is the CPU that I'm running this model on: Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz

@hanhanW
Copy link
Contributor

hanhanW commented Feb 5, 2024

Can you share the output of lscpu with me?

@vivekkhandelwal1
Copy link
Member Author

Can you share the output of lscpu with me?

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           2
    Stepping:            1
    BogoMIPS:            4589.37
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm const
                         ant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor la
                         hf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arch_capabilities
Virtualization features: 
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    8 MiB (32 instances)
  L3:                    100 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-63
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

@vivekkhandelwal1
Copy link
Member Author

@hanhanW, using this flag --iree-llvmcpu-target-cpu=cascadelake worked.

@hanhanW
Copy link
Contributor

hanhanW commented Feb 5, 2024

Okay, I can reproduce the issue if I don't specify target cpu... There is definitely a bug in the upstream pattern.

Your CPU is broadwell, and I can also reproduce it with --iree-llvmcpu-target-cpu=broadwell. I will take a look at this. Thanks for all the info!

@vivekkhandelwal1
Copy link
Member Author

broadwell

Thank you @hanhanW!

@hanhanW
Copy link
Contributor

hanhanW commented Feb 5, 2024

The quick workaround (for functionality) is deleting these https://github.com/openxla/iree/blob/af387d39d2dd553d03943c6a698cc15b6a8fc483/compiler/src/iree/compiler/Codegen/Common/DecomposePackUnPackOps.cpp#L145-L155 (I already shared it with Kumar)

I have a smaller repro now: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops))" ~/repro.mlir. I will take a deeper look tomorrow.

func.func @main_graph_dispatch_17_pack_f32() {
  %c0 = arith.constant 0 : index
  %c5 = arith.constant 5 : index
  %c1 = arith.constant 1 : index
  %cst = arith.constant 0.000000e+00 : f32
  %c3200 = arith.constant 3200 : index
  %c88320 = arith.constant 88320 : index
  %c32 = arith.constant 32 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c3200) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<?x5x5xf32>>{%c32}
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c88320) : !flow.dispatch.tensor<writeonly:tensor<?x1x5x8x1xf32>>{%c32}
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %workgroup_id_z = hal.interface.workgroup.id[2] : index
  %workgroup_count_z = hal.interface.workgroup.count[2] : index
  %2 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_z]
  %3 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_z]
  scf.for %arg0 = %2 to %c32 step %3 {
    scf.for %arg1 = %workgroup_id_y to %c1 step %workgroup_count_y {
      %4 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
      %5 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
      scf.for %arg2 = %4 to %c5 step %5 {
        %6 = affine.min affine_map<(d0) -> (-d0 + 5, 2)>(%arg2)
        %7 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1, %arg2, 0, 0], sizes = [32, 1, %6, 8, 1], strides = [1, 1, 1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<?x1x5x8x1xf32>>{%c32} -> tensor<32x1x?x8x1xf32>
        %8 = affine.min affine_map<(d0) -> (-d0 + 32, 64)>(%arg0)
        %9 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg1)
        %10 = flow.dispatch.tensor.load %0, offsets = [%arg0, %9, %arg2], sizes = [%8, 5, %6], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x5x5xf32>>{%c32} -> tensor<?x5x?xf32>
        %11 = scf.for %arg3 = %c0 to %c32 step %c1 iter_args(%arg4 = %7) -> (tensor<32x1x?x8x1xf32>) {
          %12 = scf.for %arg5 = %c0 to %6 step %c1 iter_args(%arg6 = %arg4) -> (tensor<32x1x?x8x1xf32>) {
            %13 = affine.min affine_map<(d0, d1) -> (1, d0 - d1)>(%8, %arg3)
            %extracted_slice = tensor.extract_slice %10[%arg3, 0, %arg5] [%13, 5, 1] [1, 1, 1] : tensor<?x5x?xf32> to tensor<?x5x1xf32>
            %extracted_slice_0 = tensor.extract_slice %arg6[%arg3, 0, %arg5, 0, 0] [1, 1, 1, 8, 1] [1, 1, 1, 1, 1] : tensor<32x1x?x8x1xf32> to tensor<1x1x1x8x1xf32>
            %pack = tensor.pack %extracted_slice padding_value(%cst : f32) outer_dims_perm = [0, 1, 2] inner_dims_pos = [1, 2] inner_tiles = [8, 1] into %extracted_slice_0 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 1, 2], [1, 1, 1]]>} : tensor<?x5x1xf32> -> tensor<1x1x1x8x1xf32>
            %inserted_slice = tensor.insert_slice %pack into %arg6[%arg3, 0, %arg5, 0, 0] [1, 1, 1, 8, 1] [1, 1, 1, 1, 1] : tensor<1x1x1x8x1xf32> into tensor<32x1x?x8x1xf32>
            scf.yield %inserted_slice : tensor<32x1x?x8x1xf32>
          }
          scf.yield %12 : tensor<32x1x?x8x1xf32>
        }
        flow.dispatch.tensor.store %11, %1, offsets = [%arg0, %arg1, %arg2, 0, 0], sizes = [32, 1, %6, 8, 1], strides = [1, 1, 1, 1, 1] : tensor<32x1x?x8x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<?x1x5x8x1xf32>>{%c32}
      }
    }
  }
  return
}

@hanhanW
Copy link
Contributor

hanhanW commented Feb 6, 2024

Summarize the discussion from Discord:

We hit an assertion in runtime because the frontend generates bad code. The below is the example. All the inputs are constants (which are generated by frontend), and then it hits the assertion. I will fix the compilation issue in upstream repo.

    %cst_6 = arith.constant dense<-1> : tensor<i64>
    %cst_7 = arith.constant dense<1> : tensor<4xi64>
    %cst_8 = arith.constant dense<[1, 1, 5, 5]> : tensor<4xi64>
    %6 = linalg.generic {indexing_maps = [#map2, #map3, #map2], iterator_types = ["parallel"]} ins(%cst_7, %cst_6 : tensor<4xi64>, tensor<i64>) outs(%5 : tensor<4xi64>) {
    ^bb0(%in: i64, %in_413: i64, %out: i64):
      %1654 = arith.muli %in, %in_413 : i64
      linalg.yield %1654 : i64
    } -> tensor<4xi64>
    %7 = tensor.empty() : tensor<4xi1>
    %8 = linalg.generic {indexing_maps = [#map2, #map2, #map2], iterator_types = ["parallel"]} ins(%cst_8, %6 : tensor<4xi64>, tensor<4xi64>) outs(%7 : tensor<4xi1>) {
    ^bb0(%in: i64, %in_413: i64, %out: i1):
      %1654 = arith.cmpi eq, %in, %in_413 : i64
      linalg.yield %1654 : i1
    } -> tensor<4xi1>
    %9 = linalg.generic {indexing_maps = [#map2, #map2, #map2, #map2], iterator_types = ["parallel"]} ins(%8, %cst_7, %cst_8 : tensor<4xi1>, tensor<4xi64>, tensor<4xi64>)
    ^bb0(%in: i1, %in_413: i64, %in_414: i64, %out: i64):
      %1654 = arith.select %in, %in_413, %in_414 : i64
      linalg.yield %1654 : i64
    } -> tensor<4xi64>
    %extracted_slice = tensor.extract_slice %9[0] [1] [1] : tensor<4xi64> to tensor<1xi64>
    %extracted = tensor.extract %extracted_slice[%c0] : tensor<1xi64>
    %extracted_slice_23 = tensor.extract_slice %9[1] [1] [1] : tensor<4xi64> to tensor<1xi64>
    %extracted_24 = tensor.extract %extracted_slice_23[%c0] : tensor<1xi64>
    %13 = arith.cmpi slt, %extracted_24, %c0_i64 : i64
    %14 = arith.index_cast %extracted_24 : i64 to index
    %15 = arith.select %13, %c1, %14 : index
    %71 = arith.cmpi eq, %15, %c32 : index
    cf.assert %71, "mismatched size for broadcast"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants