Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cce runtime error:errno=145 segmentation fault #9

Open
MingliSun opened this issue May 5, 2021 · 8 comments
Open

cce runtime error:errno=145 segmentation fault #9

MingliSun opened this issue May 5, 2021 · 8 comments

Comments

@MingliSun
Copy link

MingliSun commented May 5, 2021

Hi,everyone.Now I run my demo code in cce target,But an error occurred like this:

[ERROR] RUNTIME(4472)kernal task happen error, error code=0x26, [aicore exception].
[ERROR] RUNTIME(4472)aicore kernel execute failed, device_id=0, stream_id=1, task_id=0, fault kernel_name=myfunc_kernel0, func_name=myfunc_kernel0
[ERROR] AKG:2021-05-05-12:21:33.569.843 [cce_module.cc:232] [cce] Check failed: e == RT_ERROR_NONE: Cce runtime error: errno=145, info=Unknow cce error code
Stack trace:
  [bt] (0) /home/HwHiAiUser/akg/build/libakg.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x58) [0xfffef59e1cb4]
  [bt] (1) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::CceWrappedFunc::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*, void**, long*, unsigned long) const+0x610) [0xfffef6f061f4]
  [bt] (2) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::detail::PackFuncVoidAddr_<4, air::runtime::CceWrappedFunc>(air::runtime::CceWrappedFunc, std::vector<air::runtime::detail::ArgConvertCode, std::allocator<air::runtime::detail::ArgConvertCode> > const&, int)::{lambda(air::runtime::TVMArgs, air::runtime::TVMRetValue*)#1}::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x2d0) [0xfffef6f0885c]
  [bt] (3) /home/HwHiAiUser/akg/build/libakg.so(std::_Function_handler<void (air::runtime::TVMArgs, air::runtime::TVMRetValue*), air::runtime::detail::PackFuncVoidAddr_<4, air::runtime::CceWrappedFunc>(air::runtime::CceWrappedFunc, std::vector<air::runtime::detail::ArgConvertCode, std::allocator<air::runtime::detail::ArgConvertCode> > const&, int)::{lambda(air::runtime::TVMArgs, air::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, air::runtime::TVMArgs&&, air::runtime::TVMRetValue*&&)+0x7c) [0xfffef6f0d8bc]
  [bt] (4) /home/HwHiAiUser/akg/build/libakg.so(std::function<void (air::runtime::TVMArgs, air::runtime::TVMRetValue*)>::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x78) [0xfffef5a2e6b8]
  [bt] (5) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::PackedFunc::CallPacked(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x5c) [0xfffef5af9200]
  [bt] (6) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::StackVM::Run(air::runtime::StackVM::State*) const+0x14d0) [0xfffef7727b30]
  [bt] (7) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::StackVM::Run(air::runtime::TVMArgs const&, air::runtime::ModuleNode*) const+0x108) [0xfffef7726168]
  [bt] (8) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::StackVMModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, air::runtime::ObjectPtr<air::runtime::Object> const&)::{lambda(air::runtime::TVMArgs, air::runtime::TVMRetValue*)#1}::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x48) [0xfffef772c1b4]

Traceback (most recent call last):

  File "vector_add.py", line 30, in <module>
    mod(a, b, c)

  File "/home/HwHiAiUser/akg/third_party/incubator-tvm/python/tvm/_ffi/function.py", line 144, in __call__
    return f(*args)

  File "/home/HwHiAiUser/akg/third_party/incubator-tvm/python/tvm/_ffi/_ctypes/function.py", line 207, in __call__
    raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (8) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::StackVMModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, air::runtime::ObjectPtr<air::runtime::Object> const&)::{lambda(air::runtime::TVMArgs, air::runtime::TVMRetValue*)#1}::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x48) [0xfffef772c1b4]
  [bt] (7) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::StackVM::Run(air::runtime::TVMArgs const&, air::runtime::ModuleNode*) const+0x108) [0xfffef7726168]
  [bt] (6) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::StackVM::Run(air::runtime::StackVM::State*) const+0x14d0) [0xfffef7727b30]
  [bt] (5) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::PackedFunc::CallPacked(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x5c) [0xfffef5af9200]
  [bt] (4) /home/HwHiAiUser/akg/build/libakg.so(std::function<void (air::runtime::TVMArgs, air::runtime::TVMRetValue*)>::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x78) [0xfffef5a2e6b8]
  [bt] (3) /home/HwHiAiUser/akg/build/libakg.so(std::_Function_handler<void (air::runtime::TVMArgs, air::runtime::TVMRetValue*), air::runtime::detail::PackFuncVoidAddr_<4, air::runtime::CceWrappedFunc>(air::runtime::CceWrappedFunc, std::vector<air::runtime::detail::ArgConvertCode, std::allocator<air::runtime::detail::ArgConvertCode> > const&, int)::{lambda(air::runtime::TVMArgs, air::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, air::runtime::TVMArgs&&, air::runtime::TVMRetValue*&&)+0x7c) [0xfffef6f0d8bc]
  [bt] (2) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::detail::PackFuncVoidAddr_<4, air::runtime::CceWrappedFunc>(air::runtime::CceWrappedFunc, std::vector<air::runtime::detail::ArgConvertCode, std::allocator<air::runtime::detail::ArgConvertCode> > const&, int)::{lambda(air::runtime::TVMArgs, air::runtime::TVMRetValue*)#1}::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*) const+0x2d0) [0xfffef6f0885c]
  [bt] (1) /home/HwHiAiUser/akg/build/libakg.so(air::runtime::CceWrappedFunc::operator()(air::runtime::TVMArgs, air::runtime::TVMRetValue*, void**, long*, unsigned long) const+0x610) [0xfffef6f061f4]
  [bt] (0) /home/HwHiAiUser/akg/build/libakg.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x58) [0xfffef59e1cb4]
  File "/home/HwHiAiUser/akg/third_party/incubator-tvm/src/runtime/cce/cce_module.cc", line 232
TVMError: Check failed: e == RT_ERROR_NONE: Cce runtime error: errno=145, info=Unknow cce error code

Segmentation fault

Here is demo snippet:

import akg
from akg import tvm
import numpy as np

n = 5

a = tvm.placeholder([n], name='a')
b = tvm.placeholder([n], name='b')
c = tvm.compute([n], lambda i: a + b[n - i - 1])

s = tvm.create_schedule(c.op)

mod = akg.build(s, (a, b, c), 'cce', [], name='myfunc', attrs={}, polyhedral=True, binds=None)

print(mod.imported_modules[0].get_source())

a_np = np.random.random([n]).astype(a.dtype)
b_np = np.random.random([n]).astype(b.dtype)

print(a_np, b_np)
ctx = tvm.context('cce')
a = tvm.nd.array(a_np, ctx)
b = tvm.nd.array(b_np, ctx)
c = tvm.nd.array(np.zeros([n], dtype=a_np.dtype), ctx)
mod(a, b, c)
ctx.sync()  
print(c)

I'm confusing about that, it's a simple operator,it makes no sense akg cannot run it.
Thanks!

@SijiaYang
Copy link
Contributor

Hi @MingliSun , thanks for your issue. First of all, I cannot reproduce the error in your description: TVMError: Check failed: e == RT_ERROR_NONE: Cce runtime error: errno=145, info=Unknow cce error code.

Can you please check your compile option? I am using cmake .. and when I ran your script, I got this error

TVMError: Check failed: ObjectTypeChecker<TObjectRef>: :Check(ptr): Expected type List[Expr] but get Array. So I am trying to resolve this error.

The problem in your code is that you pass a into tvm.compute in this line

-> c = tvm.compute([n], lambda i: a + b[n - i - 1])

Instead, you can try to modify it to

c = tvm.compute([n], lambda i: a[i] + b[n - i - 1])

and the code runs normally. I use index i to get element in array a and I guess this is what you want to compute. If not, you can change it according to your compute logic and remember to retrieve elements in tvm.placeholder using index. And the output is as follows:

#ifdef __CCE_KT_TEST__
#define __aicore__ 
#else
#define __aicore__ [aicore]
#endif

extern "C"  __global__ __aicore__ void myfunc_kernel0(__gm__ float* __restrict__ a, __gm__ float* __restrict__ b, __gm__ float* __restrict__ compute) {
  __ubuf__  float* b_local_UB = (__ubuf__  float *)(0);
  __ubuf__  float* compute_0_local_UB = (__ubuf__  float *)(32);
  __ubuf__  float* compute_local_UB = (__ubuf__  float *)(32);
  __ubuf__  float* a_local_UB = (__ubuf__  float *)(64);
  copy_gm_to_ubuf(((__ubuf__ float *)b_local_UB + 0), ((__gm__ float *)b + 0), 0, 1, 1, 0, 0);
  set_flag(PIPE_MTE2, PIPE_S, EVENT_ID0);
  wait_flag(PIPE_MTE2, PIPE_S, EVENT_ID0);
  for (int32_t cc1 = 0; cc1 < 5; ++cc1) {
    compute_0_local_UB[cc1] = b_local_UB[(4 - cc1)];
  }
  set_flag(PIPE_S, PIPE_V, EVENT_ID0);
  copy_gm_to_ubuf(((__ubuf__ float *)a_local_UB + 0), ((__gm__ float *)a + 0), 0, 1, 1, 0, 0);
  set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
  set_vector_mask((uint64_t)0llu, (uint64_t)255llu);
  wait_flag(PIPE_S, PIPE_V, EVENT_ID0);
  wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
  vadd(((__ubuf__ float *)compute_local_UB + 0), ((__ubuf__ float *)a_local_UB + 0), ((__ubuf__ float *)compute_0_local_UB + 0), 1, 1, 1, 1, 0, 0, 0);
  set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
  wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
  copy_ubuf_to_gm(((__gm__ float *)compute + 0), ((__ubuf__ float *)compute_local_UB + 0), 0, 1, 1, 0, 0);
  pipe_barrier(PIPE_ALL);
  
  uint64_t *statusInScalarBuffer = (uint64_t *) 0x40000;
  uint64_t status0 = get_status();
  uint64_t overflowStatus = status0 & 0x00000000000005B8;
  if (overflowStatus > 0) {
    *statusInScalarBuffer = 1;
  }
}


[0.2562516  0.64928097 0.10287885 0.85447043 0.14970173] [0.6659899  0.09130476 0.17711475 0.17782658 0.22820798]
[0. 0. 0. 0. 0.]

Hope this is helpful.

@MingliSun
Copy link
Author

Thanks for your reply,I write c = tvm.compute([n], lambda i: a[i] + b[n - i - 1])in my computer,maybe it's a copy mistake or something.But I still got the same error code=145 after modift it as told both in my development board and huaweicloud 弹性云服务器. And I noticed your output c is all zero vector,but it should be the result of a[i]+b[n-i-1].
Here is my error message in huaweicloud:

[ERROR] RUNTIME(5611)kernal task happen error, error code=0x26, [aicore exception].
[ERROR] RUNTIME(5611)aicore kernel execute failed, device_id=0, stream_id=512, task_id=0, fault kernel_name=myfunc_kernel0, func_name=myfunc_kernel0
[ERROR] AKG:2021-05-08-11:30:38.538.890 [cce_device_api.cc:126] [cce] Check failed: e == RT_ERROR_NONE: Cce runtime error: errno=145, info=Unknow cce error code
Stack trace:
  [bt] (0) /root/sunmingli/akg/build/libakg.so(air::runtime::CceDeviceAPI::StreamSync(DLContext, void*)+0x108c) [0x7f26ddf3ccfc]
  [bt] (1) /root/sunmingli/akg/build/libakg.so(TVMSynchronize+0x5c) [0x7f26ddf3103c]
  [bt] (2) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f26e7893dae]
  [bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f26e789371f]
  [bt] (4) /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2b4) [0x7f26e7aa75c4]
  [bt] (5) /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x11c33) [0x7f26e7aa7c33]
  [bt] (6) python3(_PyObject_FastCallKeywords+0x19c) [0x5a9dac]
  [bt] (7) python3() [0x50a433]
  [bt] (8) python3(_PyEval_EvalFrameDefault+0x444) [0x50beb4]

Traceback (most recent call last):

  File "demo/vector_add.py", line 34, in <module>
    ctx.sync()

  File "/root/sunmingli/akg/third_party/incubator-tvm/python/tvm/_ffi/runtime_ctypes.py", line 248, in sync
    check_call(_LIB.TVMSynchronize(self.device_type, self.device_id, None))

  File "/root/sunmingli/akg/third_party/incubator-tvm/python/tvm/_ffi/base.py", line 335, in check_call
    raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (1) /root/sunmingli/akg/build/libakg.so(TVMSynchronize+0x5c) [0x7f26ddf3103c]
  [bt] (0) /root/sunmingli/akg/build/libakg.so(air::runtime::CceDeviceAPI::StreamSync(DLContext, void*)+0x108c) [0x7f26ddf3ccfc]
  File "/root/sunmingli/akg/third_party/incubator-tvm/src/runtime/cce/cce_device_api.cc", line 126
TVMError: Check failed: e == RT_ERROR_NONE: Cce runtime error: errno=145, info=Unknow cce error code

the difference is that development board aborted at mod(a,b,c) and huweicloud aborted at ctx.sysc()
I don't know what's wrong with it.

@anyrenwei
Copy link
Contributor

Ascend910 or?

@MingliSun
Copy link
Author

MingliSun commented May 9, 2021

@anyrenwei both my development board and huaweicloud are Ascend310

@ckeyever
Copy link

@MingliSun There is some difference of the instruction set between Ascend910 and Ascend310, and we now only support Ascend910.

@MingliSun
Copy link
Author

@ckeyever do all versions of akg only support Ascend910 not Ascend310 even the initial version? if so , I will change my target into Ascend910.Thanks for your reply.

@ckeyever
Copy link

ckeyever commented May 13, 2021

@ckeyever do all versions of akg only support Ascend910 not Ascend310 even the initial version? if so , I will change my target into Ascend910.Thanks for your reply.

Yes, Ascend310 is not support at the begining.

@ckeyever
Copy link

Issue will be closed if there is no other question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants