Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool hangs with Fortran offloaded to the GPU with XL #76

Open
ilagunap opened this issue Jan 18, 2022 · 3 comments
Open

Tool hangs with Fortran offloaded to the GPU with XL #76

ilagunap opened this issue Jan 18, 2022 · 3 comments

Comments

@ilagunap
Copy link

One of the provided tools (record_reg_vals) hangs with FORTRAN code offloaded in the GPU via OpenMP with the XL compiler. Interestingly the behavior is correct when the code is compiled with nvfortran. This is an IBM PPC platform with V100 GPUs.

I suppose that this is a problem with XL generating code incompatible with NVBit and I'd like to report it to IBM, but I would appreciate your help in digging more into the issue.

Below is a simple FORTRAN program with an OpenMP parallel loop that is offloaded in the GPU. When I compile with nvfortran and run the record_reg_vals tool it works correctly. When I use the XL compiler, the program hangs.

The program is:

program main
  implicit none

  integer, parameter :: sz = 32768
  integer, dimension(sz) :: arr
  integer :: i

  do i = 1, sz
    arr(i) = 42
  end do

  !$omp target teams distribute parallel do map(tofrom: arr(1:sz))
  do i = 1, sz
    arr(i) = arr(i) + 1
  end do

  print *, "After the target region is executed, arr(1) = ", arr(1)

end program main

This is how it is compiled:

$ xlf90-gpu -qoffload -qsmp -g -o main main.f90

I profiled it with nvcc to make sure the kernel is executed:

$ ./main 
 After the target region is executed, arr(1) =  43
$ nvprof ./main 
==56003== NVPROF is profiling process 56003, command: ./main
 After the target region is executed, arr(1) =  43
==56003== Profiling application: ./main
==56003== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.18%  7.1360us         2  3.5680us  1.8560us  5.2800us  [CUDA memcpy DtoH]
                   36.13%  6.7520us         1  6.7520us  6.7520us  6.7520us  [CUDA memcpy HtoD]
                   25.68%  4.8000us         1  4.8000us  4.8000us  4.8000us  __xl_main_l12_OL_1
      API calls:   49.15%  222.83ms         1  222.83ms  222.83ms  222.83ms  cuModuleLoadDataEx
                   41.59%  188.53ms         1  188.53ms  188.53ms  188.53ms  cuCtxCreate

This is the output of the tool:

$ LD_PRELOAD=/usr/workspace/nvbit_release/tools/record_reg_vals/record_reg_vals.so ./main
------------- NVBit (NVidia Binary Instrumentation Tool v1.5.4) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
...
...
CTA 162,0,0 - warp 11 - IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;:
* Reg0_T0: 0x00000000 Reg0_T1: 0x00000000 Reg0_T2: 0x00000000 Reg0_T3: 0x00000000 Reg0_T4: 0x00000000 Reg0_T5: 0x00000000 Reg0_T6: 0x00000000 Reg0_T7: 0x00000000 Reg0_T8: 0x00000000 Reg0_T9: 0x00000000 Reg0_T10: 0x00000000 Reg0_T11: 0x00000000 Reg0_T12: 0x00000000 Reg0_T13: 0x00000000 Reg0_T14: 0x00000000 Reg0_T15: 0x00000000 Reg0_T16: 0x00000000 Reg0_T17: 0x00000000 Reg0_T18: 0x00000000 Reg0_T19: 0x00000000 Reg0_T20: 0x00000000 Reg0_T21: 0x00000000 Reg0_T22: 0x00000000 Reg0_T23: 0x00000000 Reg0_T24: 0x00000000 Reg0_T25: 0x00000000 Reg0_T26: 0x00000000 Reg0_T27: 0x00000000 Reg0_T28: 0x00000000 Reg0_T29: 0x00000000 Reg0_T30: 0x00000000 Reg0_T31: 0x00000000 
* Reg1_T0: 0x00000000 Reg1_T1: 0x00000000 Reg1_T2: 0x00000000 Reg1_T3: 0x00000000 Reg1_T4: 0x00000000 Reg1_T5: 0x00000000 Reg1_T6: 0x00000000 Reg1_T7: 0x00000000 Reg1_T8: 0x00000000 Reg1_T9: 0x00000000 Reg1_T10: 0x00000000 Reg1_T11: 0x00000000 Reg1_T12: 0x00000000 Reg1_T13: 0x00000000 Reg1_T14: 0x00000000 Reg1_T15: 0x00000000 Reg1_T16: 0x00000000 Reg1_T17: 0x00000000 Reg1_T18: 0x00000000 Reg1_T19: 0x00000000 Reg1_T20: 0x00000000 Reg1_T21: 0x00000000 Reg1_T22: 0x00000000 Reg1_T23: 0x00000000 Reg1_T24: 0x00000000 Reg1_T25: 0x00000000 Reg1_T26: 0x00000000 Reg1_T27: 0x00000000 Reg1_T28: 0x00000000 Reg1_T29: 0x00000000 Reg1_T30: 0x00000000 Reg1_T31: 0x00000000 
* Reg2_T0: 0x00000000 Reg2_T1: 0x00000000 Reg2_T2: 0x00000000 Reg2_T3: 0x00000000 Reg2_T4: 0x00000000 Reg2_T5: 0x00000000 Reg2_T6: 0x00000000 Reg2_T7: 0x00000000 Reg2_T8: 0x00000000 Reg2_T9: 0x00000000 Reg2_T10: 0x00000000 Reg2_T11: 0x00000000 Reg2_T12: 0x00000000 Reg2_T13: 0x00000000 Reg2_T14: 0x00000000 Reg2_T15: 0x00000000 Reg2_T16: 0x00000000 Reg2_T17: 0x00000000 Reg2_T18: 0x00000000 Reg2_T19: 0x00000000 Reg2_T20: 0x00000000 Reg2_T21: 0x00000000 Reg2_T22: 0x00000000 Reg2_T23: 0x00000000 Reg2_T24: 0x00000000 Reg2_T25: 0x00000000 Reg2_T26: 0x00000000 Reg2_T27: 0x00000000 Reg2_T28: 0x00000000 Reg2_T29: 0x00000000 Reg2_T30: 0x00000000 Reg2_T31: 0x00000000

[hangs here...]

If it helps, I noticed that the void nvbit_at_ctx_init(CUcontext ctx) function is called at least twice (I put a printf statement to confirm) with XL, but it's called only once with nvfortran.

As I said, when I compile with nvfortran the program terminates correctly and I can use the tool:

nvfortran -O0 -o main main.f90 -mp=gpu -g

Here are my system specs:

$ gcc --version
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:21_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

$ xlf90-gpu -qversion
IBM XL Fortran for Linux, V16.1.1 (5725-C75, 5765-J15)
Version: 16.01.0001.0011

System: ppc64le GNU/Linux

Thank you!!

@ovilla
Copy link
Collaborator

ovilla commented Jan 18, 2022

After thinking about this a bit more I have decided to delete my previous comment, as it could be totally wrong and misleading.
At this point I am not sure what is going on and more analysis is needed.
We will keep thinking about it.
Thanks for reporting.

@ilagunap
Copy link
Author

I believe you were right. It seems the issue is related to the channel functionality and there is a deadlock. The reason I said this is that I removed the two channels (host and device) and the related code from the tool and the hangs disappears. Also, this is the only tool that hangs.

Question: how can I get line information (like the one provided by the nvbit_get_line_info function) from an injected function? If there is a way to get debug info from an injected function, I don’t need the channel.

I see the injected function for this tool is the following – is there a way to pass debug info (perhaps a string) or to get it from this device function (perhaps using the optcode)?

    40	extern "C" __device__ __noinline__ void record_reg_val(int pred, int opcode_id,
    41	                                                       uint64_t pchannel_dev,
    42	                                                       int32_t num_regs...) {

@ovilla
Copy link
Collaborator

ovilla commented Jan 19, 2022

What is a bit surprising is that the issue seems related on which compiler is used on the target application and not to target the nvbit tool compilation.

Initially I thought you were using xlf90 to compile the nvbit tool and I assumed that caused a problem in the channel, but if xlf90 is used on the application that should not create any problem.

Maybe the way xlf90 initializes the GPU when using OpenMP could interfere with the channel code, but it is all wild guessing at this point since I have never seen this problem before. I will try to reproduce on this side, but chances are small.

Regarding passing something to the injection function, you can pass immediates or pointers (casted as uint64_t like in the pchannel_dev above). So if you want to pass a string I would suggest you allocate the string in GPU device memory during the instrumentation phase and pass a pointer to the instrumentation function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants