Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) #15203

Closed
MajidAbdelilah opened this issue Aug 27, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@MajidAbdelilah
Copy link

MajidAbdelilah commented Aug 27, 2024

Describe the bug

the program failed with this output
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
i have an intel core I7 1255U and no dedicated graphics card the bug happen when running on the intel integrated gpu

To reproduce

  1. #include <sycl/detail/helpers.hpp>
    #include <sycl/sycl.hpp>
    #include <math.h>
    #include

int main()
{

float m1[4][4] = {{2, 2, 9, 1}, {8,3,2,4}, {2,5,6,7}, {3,5,6,4}};
float m2[4][4] = {{2, 2, 91, 1}, {8,3,21,4}, {2,3,6,0}, {31,5,6,4}};
sycl::queue q{sycl::gpu_selector_v};

float **device_arr = (float **)malloc(sizeof(float *) * 4);
float **device_m1 = (float **)malloc(sizeof(float *) * 4);
float **device_m2 = (float **)malloc(sizeof(float *) * 4);

for(int i = 0; i < 4; i++)
{
    device_arr[i] = sycl::malloc_device<float>(4, q);
    device_m1[i] = sycl::malloc_device<float>(4, q);
    q.memcpy(device_m1[i], m1[i], sizeof(float) * 4);
    device_m2[i] = sycl::malloc_device<float>(4, q);
    q.memcpy(device_m2[i], m2[i], sizeof(float) * 4);
}
q.wait();

const long long work_size_x = 4;
const long long work_size_y = 4;
const unsigned long work_size = std::pow(2, 32);
const unsigned long workers = work_size/4096;

    for(unsigned long a = 0; a < work_size; a+=workers)
    {
        if((a+workers) > work_size)
        {
            q.parallel_for(sycl::range{work_size - a}, [=](sycl::item<1> it){
                for (int i = 0; i < 4; i++) {
                    for (int j = 0; j < 4; j++) {
                        device_arr[i][j] = 0;
                        for (int k = 0; k < 4; k++) {
                            device_arr[i][j] += device_m1[i][k] * device_m2[k][j];
                        }
                    }
                }
            });
        }else
        {
            q.parallel_for(sycl::range{workers}, [=](sycl::item<1> it){
                for (int i = 0; i < 4; i++) {
                    for (int j = 0; j < 4; j++) {
                        device_arr[i][j] = 0;
                        for (int k = 0; k < 4; k++) {
                            device_arr[i][j] += device_m1[i][k] * device_m2[k][j];
                        }
                    }
                }
            });
        }
    }
clock_t begin_sycl = clock();
q.wait();
clock_t end_sycl = clock();
double time_spent_sycl = (double)(end_sycl - begin_sycl) / CLOCKS_PER_SEC / 12;
std::cout << "Time spent on SYCL: " << time_spent_sycl << std::endl;
for(int i = 0; i < 4; i++)
{
    sycl::free(device_arr[i], q);
    sycl::free(device_m1[i], q);
    sycl::free(device_m2[i], q);
}

}

  1. dpcpp -xhost c++\ vs\ sycl/sycl.cpp -O3
  2. ./a.out
  3. expected to compute the matrix multiply but it doesnt it gives an error

Environment

  • OS: arch linux
  • target: intel iris xe 96EU gpu
  • DPC++ version: Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308)
    Target: x86_64-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /opt/intel/oneapi/compiler/2024.1/bin/compiler
    Configuration file: /opt/intel/oneapi/compiler/2024.1/bin/compiler/../icpx.cfg
  • Dependencies version: ```[abdelilah@archlinux learn SYCL]$ sycl-ls --verbose
    [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
    [opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i7-1255U OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
    [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [24.26.30049]
    [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.30049]

Platforms: 4
Platform [#1]:
Version : OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
Name : Intel(R) FPGA Emulation Platform for OpenCL(TM)
Vendor : Intel(R) Corporation
Devices : 1
Device [#0]:
Type : acc
Version : OpenCL 1.2
Name : Intel(R) FPGA Emulation Device
Vendor : Intel(R) Corporation
Driver : 2024.17.3.0.08_160000
Aspects : accelerator fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_atomic_host_allocations usm_atomic_shared_allocations ext_oneapi_srgb ext_oneapi_non_uniform_groups
info::device::sub_group_sizes: 4 8 16 32 64
Platform [#2]:
Version : OpenCL 3.0 LINUX
Name : Intel(R) OpenCL
Vendor : Intel(R) Corporation
Devices : 1
Device [#1]:
Type : cpu
Version : OpenCL 3.0 (Build 0)
Name : 12th Gen Intel(R) Core(TM) i7-1255U
Vendor : Intel(R) Corporation
Driver : 2024.17.3.0.08_160000
Aspects : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_non_uniform_groups
info::device::sub_group_sizes: 4 8 16 32 64
Platform [#3]:
Version : OpenCL 3.0
Name : Intel(R) OpenCL Graphics
Vendor : Intel(R) Corporation
Devices : 1
Device [#2]:
Type : gpu
Version : OpenCL 3.0 NEO
Name : Intel(R) Iris(R) Xe Graphics
Vendor : Intel(R) Corporation
Driver : 24.26.30049
Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations atomic64 ext_oneapi_srgb ext_intel_device_id ext_intel_legacy_image ext_intel_esimd ext_oneapi_non_uniform_groups
info::device::sub_group_sizes: 8 16 32
Platform [#4]:
Version : 1.3
Name : Intel(R) Level-Zero
Vendor : Intel(R) Corporation
Devices : 1
Device [#0]:
Type : gpu
Version : 1.3
Name : Intel(R) Iris(R) Xe Graphics
Vendor : Intel(R) Corporation
Driver : 1.3.30049
Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address ext_intel_gpu_eu_count ext_intel_gpu_eu_simd_width ext_intel_gpu_slices ext_intel_gpu_subslices_per_slice ext_intel_gpu_eu_count_per_subslice atomic64 ext_intel_device_info_uuid ext_intel_gpu_hw_threads_per_eu ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_intel_esimd ext_oneapi_non_uniform_groups
info::device::sub_group_sizes: 8 16 32
default_selector() : gpu, Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.30049]
accelerator_selector() : acc, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
cpu_selector() : cpu, Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i7-1255U OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
gpu_selector() : gpu, Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.30049]
custom_selector(gpu) : gpu, Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.30049]
custom_selector(cpu) : cpu, Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i7-1255U OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
custom_selector(acc) : acc, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
[abdelilah@archlinux learn SYCL]$```

Additional context

No response

@MajidAbdelilah MajidAbdelilah added the bug Something isn't working label Aug 27, 2024
@KornevNikita
Copy link
Contributor

Hi @MajidAbdelilah! What about other devices? Like cpu or opencl:gpu (I guess you're running it on level_zero:gpu, right?). Could you also attach the output of SYCL_PI_TRACE=-1 ./a.out.

@bader
Copy link
Contributor

bader commented Sep 24, 2024

SYCL_PI_TRACE -> SYCL_UR_TRACE.
@KornevNikita, please, update your DPC++ compiler runtime. :)

@MajidAbdelilah
Copy link
Author

MajidAbdelilah commented Sep 24, 2024

when running on the cpu using cpu_selector_v it is working without runtime errors
for the SYCL_UR_TRACE=-1

[abdelilah@archlinux aur]$ SYCL_UR_TRACE=-1 ./a.out 
ZE_LOADER_DEBUG_TRACE:Using Loader Library Path: 
ZE_LOADER_DEBUG_TRACE:Tracing Layer Library Path: libze_tracing_layer.so.1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)
[abdelilah@archlinux aur]$

for SYCL_PI_TRACE=-1 ./a.out i have attached the stdout, stderr, and allstd.
allout.txt
stderr.txt
stdout.txt

@KornevNikita
Copy link
Contributor

KornevNikita commented Sep 25, 2024

SYCL_PI_TRACE -> SYCL_UR_TRACE. @KornevNikita, please, update your DPC++ compiler runtime. :)

Yes, but it looks like @MajidAbdelilah is using 2024.1.0, so I guess there should still be the PI.

I tried this with latest intel/llvm. It passes on opencl:cpu, but fails on both level-zero and opencl gpu with:

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)

Adding "confirmed" label for runtime team to take a look.

FYI - this code looks very complicated, although it can be made much simpler like:

#include <sycl/sycl.hpp>

#include <iostream>
#include <cmath>

int main() {
    float m1[16] = {2, 2, 9, 1, 8, 3, 2, 4, 2, 5, 6, 7, 3, 5, 6, 4};
    float m2[16] = {2, 2, 91, 1, 8, 3, 21, 4, 2, 5, 6, 0, 31, 5, 6, 4};

    sycl::queue q;

    float *device_arr = sycl::malloc_device<float>(4 * 4, q);
    float *device_m1 = sycl::malloc_device<float>(4 * 4, q);
    float *device_m2 = sycl::malloc_device<float>(4 * 4, q);

    q.memcpy(device_m1, m1, sizeof(float) * 4 * 4);
    q.memcpy(device_m2, m2, sizeof(float) * 4 * 4);
    q.wait();

    const unsigned long work_size = 4 * 4; 

    q.parallel_for(sycl::range<1>(work_size), [=](sycl::id<1> idx) {
        device_arr[idx] = 0;
        int i = idx / 4;
        int j = idx % 4;
        for (int k = 0; k < 4; k++) {
            device_arr[idx] += device_m1[i * 4 + k] * device_m2[k * 4 + j];
        }
    }).wait();

    clock_t begin_sycl = clock();
    clock_t end_sycl = clock();
    double time_spent_sycl = (double)(end_sycl - begin_sycl) / CLOCKS_PER_SEC / 12;
    std::cout << "Time spent on SYCL: " << time_spent_sycl << std::endl;

    for (int i = 0; i < 4; i++) {
        for (int j = 0; j < 4; j++)
          std::cout << device_arr[i * 4 + j] << " ";
        std::cout << std::endl;
    }

    sycl::free(device_arr, q);
    sycl::free(device_m1, q);
    sycl::free(device_m2, q);
}

@aelovikov-intel
Copy link
Contributor

aelovikov-intel commented Sep 26, 2024

@KornevNikita , your reproducer has a bug when printing the result on the host side - device_arr is a Device USM allocation and can't be accessed from the host.

@MajidAbdelilah your code has undefined behavior (UB) in it. float **device_m1 = (float **)malloc(sizeof(float *) * 4); allocated memory that isn't accessible from the GPU. Each memory access in device_arr[i][j] += device_m1[i][k] * device_m2[k][j]; in the device code is two dereferences, not one. First dereference is UB because that's not the USM memory.

@aelovikov-intel aelovikov-intel closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024
@KornevNikita
Copy link
Contributor

@KornevNikita , your reproducer has a bug when printing the result on the host side - device_arr is a Device USM allocation and can't be accessed from the host.

Oops, forgot to delete this block. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants