Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugReport] python_native fails to compile #1192

Closed
society-research opened this issue Mar 29, 2024 · 9 comments
Closed

[BugReport] python_native fails to compile #1192

society-research opened this issue Mar 29, 2024 · 9 comments

Comments

@society-research
Copy link

Dear FLAMEGPU2 devs, I ran into a pretty straight forward issue, so it's likely I did some setup wrong, if another way to get in touch is preferred over a bug report, please let me know!

How to reproduce:

  1. Build FLAMEGPU2, with FLAMEGPU_BUILD_PYTHON=ON, FLAMEGPU_VISUALISATION=ON, CMAKE_BUILD_TYPE=Release.

  2. Run

(venv) ➜  build git:(master) ✗ python ../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py
Traceback (most recent call last):
  File "/home/ubuntu/model-socix-py/third_party/FLAMEGPU2/build/../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 389, in <module>
    cudaSimulation.initialise(sys.argv)
  File "/home/ubuntu/model-socix-py/third_party/FLAMEGPU2/build/venv/lib/python3.10/site-packages/pyflamegpu/pyflamegpu.py", line 9089, in initialise
    return _pyflamegpu.Simulation_initialise(self, argc)
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (InvalidAgentFunc) /home/ubuntu/FLAMEGPU2-model-template-python/third_party/FLAMEGPU2/src/flamegpu/detail/JitifyCache.cu(422): Error compiling runtime agent function (or function condition) (
'outputdata'): function had compilation errors (see std::cout), in JitifyCache::buildProgram().

System Information: Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-26-generic x86_64), NVIDIA RTX A4000.

(venv) ➜  build git:(master) ✗ nvidia-smi
Fri Mar 29 13:36:47 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               On  | 00000000:00:05.0 Off |                  Off |
| 41%   33C    P8              15W / 140W |      1MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
(venv) ➜  build git:(master) ✗ dpkg -l|grep cuda
ii  cuda-cccl-12-2                       12.2.140-1                              amd64        CUDA CCCL
ii  cuda-command-line-tools-12-2         12.2.2-1                                amd64        CUDA command-line tools
ii  cuda-compiler-12-2                   12.2.2-1                                amd64        CUDA compiler
ii  cuda-crt-12-2                        12.2.140-1                              amd64        CUDA crt
ii  cuda-cudart-12-2                     12.2.140-1                              amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-12-2                 12.2.140-1                              amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-12-2                  12.2.140-1                              amd64        CUDA cuobjdump
ii  cuda-cupti-12-2                      12.2.142-1                              amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-12-2                  12.2.142-1                              amd64        CUDA profiling tools interface.
ii  cuda-cuxxfilt-12-2                   12.2.140-1                              amd64        CUDA cuxxfilt
ii  cuda-documentation-12-2              12.2.140-1                              amd64        CUDA documentation
ii  cuda-driver-dev-12-2                 12.2.140-1                              amd64        CUDA Driver native dev stub library
ii  cuda-gdb-12-2                        12.2.140-1                              amd64        CUDA-GDB
ii  cuda-keyring                         1.1-1                                   all          GPG keyring for the CUDA repository
ii  cuda-libraries-12-2                  12.2.2-1                                amd64        CUDA Libraries 12.2 meta-package
ii  cuda-libraries-dev-12-2              12.2.2-1                                amd64        CUDA Libraries 12.2 development meta-package
ii  cuda-nsight-12-2                     12.2.144-1                              amd64        CUDA nsight
ii  cuda-nsight-compute-12-2             12.2.2-1                                amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-12-2             12.2.2-1                                amd64        NVIDIA Nsight Systems
ii  cuda-nvcc-12-2                       12.2.140-1                              amd64        CUDA nvcc
ii  cuda-nvdisasm-12-2                   12.2.140-1                              amd64        CUDA disassembler
ii  cuda-nvml-dev-12-2                   12.2.140-1                              amd64        NVML native dev links, headers
ii  cuda-nvprof-12-2                     12.2.142-1                              amd64        CUDA Profiler tools
ii  cuda-nvprune-12-2                    12.2.140-1                              amd64        CUDA nvprune
ii  cuda-nvrtc-12-2                      12.2.140-1                              amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-12-2                  12.2.140-1                              amd64        NVRTC native dev links, headers
ii  cuda-nvtx-12-2                       12.2.140-1                              amd64        NVIDIA Tools Extension
ii  cuda-nvvm-12-2                       12.2.140-1                              amd64        CUDA nvvm
ii  cuda-nvvp-12-2                       12.2.142-1                              amd64        CUDA Profiler tools
ii  cuda-opencl-12-2                     12.2.140-1                              amd64        CUDA OpenCL native Libraries
ii  cuda-opencl-dev-12-2                 12.2.140-1                              amd64        CUDA OpenCL native dev links, headers
ii  cuda-profiler-api-12-2               12.2.140-1                              amd64        CUDA Profiler API
ii  cuda-sanitizer-12-2                  12.2.140-1                              amd64        CUDA Sanitizer
ii  cuda-toolkit-12-2                    12.2.2-1                                amd64        CUDA Toolkit 12.2 meta-package
ii  cuda-toolkit-12-2-config-common      12.2.140-1                              all          Common config package for CUDA Toolkit 12.2.
ii  cuda-toolkit-12-config-common        12.4.99-1                               all          Common config package for CUDA Toolkit 12.
ii  cuda-toolkit-config-common           12.4.99-1                               all          Common config package for CUDA Toolkit.
ii  cuda-tools-12-2                      12.2.2-1                                amd64        CUDA Tools meta-package
ii  cuda-visual-tools-12-2               12.2.2-1                                amd64        CUDA visual tools
@Robadob
Copy link
Member

Robadob commented Mar 29, 2024

pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (InvalidAgentFunc) /home/ubuntu/FLAMEGPU2-model-template-python/third_party/FLAMEGPU2/src/flamegpu/detail/JitifyCache.cu(422): Error compiling runtime agent function (or function condition) (
'outputdata'): function had compilation errors (see std::cout), in JitifyCache::buildProgram().

This is an error trying to compile the agent function outputdata at runtime.

I'm assuming you've shared stderr, do you have stdout in a separate output file? (There is/was an issue where Google collab would eat the runtime compilation error messages, but I'm not aware of it occurring outside of collab).

Likewise, are you able to share the agent function. This would help me identify what your compilation error could be.

@Robadob
Copy link
Member

Robadob commented Mar 29, 2024

Ah sorry, just spotted this is with one of the examples. Give me an hour to look into it.

@Robadob
Copy link
Member

Robadob commented Mar 29, 2024

I've just built a clean copy of pyflamegpu from FLAMEGPU2's master branch.

When running boids_spatial3D.py, the runtime compilation of the model, that is failing for you, succeeds for me.

Are you able to share:

  • Which version of flamegpu you're using? E.g. are you pulling a release tag rather than HEAD of master?
  • The stdout that includes the runtime compilation error output by Jitify?

It did fail to run under a debug build however.

(venv) C:\Users\Robadob\fgpu2\examples\python_native\boids_spatial3D_wrapped>python boids_spatial3D.py
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[14,0,0][0,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

Traceback (most recent call last):
  File "C:\Users\Robadob\fgpu2\examples\python_native\boids_spatial3D_wrapped\boids_spatial3D.py", line 430, in <module>
    cudaSimulation.simulate()
  File "C:\Users\Robadob\fgpu2\build\lib\Debug\python\venv\lib\site-packages\pyflamegpu\pyflamegpu.py", line 9255, in simulate
    return _pyflamegpu.CUDASimulation_simulate(self, *args)
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (DeviceError) Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[14,0,0][0,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

But I think I must have broken this with this PR: #1160, so that's a separate issue. Second time today, I refer to @ptheywood, I know he was looking at moving this check to outside of device code (#1182).

@society-research
Copy link
Author

society-research commented Mar 30, 2024

Are you able to share:

  • Which version of flamegpu you're using? E.g. are you pulling a release tag rather than HEAD of master?

I'm using the master branch, just clone the repository the day before yesterday.

  • The stdout that includes the runtime compilation error output by Jitify?

I have no idea where stdout is going. I'm not redirecting anything in my shell. Is stdout redirected by default somewhere? If so where?

I'd love to help! Could you let me know how to get the debug output that is present in your call to boids_spatial3D.py? Even with a CMAKE_BUILD_TYPE=Debug build I don't get that debug information that is printed to your terminal. Do you set any environment variable? How can I trace the compilation error to a line in the python code?

@Robadob
Copy link
Member

Robadob commented Mar 30, 2024

I've now built it with CUDA 12.2 on Linux (no visualisation though, I've only got access to headless Linux boxes). Same issue post-runtime compilation that I was getting on Windows (with Visualisation) yesterday (already known #1177, with a few suitable workarounds).

(py311) rob@mavericks:~/fgpu2/build/lib/Release/python/venv/bin$ source activate
(venv) (py311) rob@mavericks:~/fgpu2/build/lib/Release/python/venv/bin$ cd ../../../..
(venv) (py311) rob@mavericks:~/fgpu2/build/lib$ cd ../..
(venv) (py311) rob@mavericks:~/fgpu2$ cd examples/python_native/boids_spatial3D_wrapped/
(venv) (py311) rob@mavericks:~/fgpu2/examples/python_native/boids_spatial3D_wrapped$ python boids_spatial3D.py
Traceback (most recent call last):
  File "/home/rob/fgpu2/examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 430, in <module>
    cudaSimulation.simulate()
  File "/home/rob/fgpu2/build/lib/Release/python/venv/lib/python3.11/site-packages/pyflamegpu/pyflamegpu.py", line 9255, in simulate
    return _pyflamegpu.CUDASimulation_simulate(self, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (DeviceError) Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[39,0,0][608,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

Likewise, I called the same example from build rather than it's own dir, and got much the same output.

I have no idea where stdout is going. I'm not redirecting anything in my shell. Is stdout redirected by default somewhere? If so where?

Runtime compilation errors should go to regular stdout by default, I wrongly assumed you might be running on HPC or similar that splits stdout/stderr into separate files. It's handled by a 3rd party lib, and I can't recall the reason it didn't/doesn't work properly on Google collab.

I forced a compilation error in the same example, and this is how it appeared.

(venv) (py311) rob@mavericks:~/fgpu2/examples/python_native/boids_spatial3D_wrapped$ python boids_spatial3D.py
---------------------------------------------------
--- JIT compile log for inputdata_program ---
---------------------------------------------------
inputdata_impl.cu(37): error: too few arguments in function call
              auto separation = vec3Length((agent_x - message_x), (agent_y - message_y));
                                                                                       ^

1 error detected in the compilation of "inputdata_program".

---------------------------------------------------
Traceback (most recent call last):
  File "/home/rob/fgpu2/examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 388, in <module>
    cudaSimulation.initialise(sys.argv)
  File "/home/rob/fgpu2/build/lib/Release/python/venv/lib/python3.11/site-packages/pyflamegpu/pyflamegpu.py", line 9089, in initialise
    return _pyflamegpu.Simulation_initialise(self, argc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (InvalidAgentFunc) /home/rob/fgpu2/src/flamegpu/detail/JitifyCache.cu(422): Error compiling runtime agent function (or function condition) ('inputdata'): function had compilation errors (see std::cout), in JitifyCache::buildProgram().

Technically it could be a non-compilation exception being thrown by Jitify, hence no compilation log, but I'm not sure what that would be.

At src/flamegpu/detail/JitifyCache:420-425 you will find the try/catch that is eating this exception.

    } catch (std::runtime_error const&) {
        // jitify does not have a method for getting compile logs so rely on JITIFY_PRINT_LOG defined in cmake
        THROW exception::InvalidAgentFunc("Error compiling runtime agent function (or function condition) ('%s'): function had compilation errors (see std::cout), "
            "in JitifyCache::buildProgram().",
            func_name.c_str());
    }

If you replace that with the below statement, recompile pyflamegpu and run the example again you may get more useful information out.

    } catch (std::runtime_error const&e) {
        printf(e.what());
        throw;
    }

Given I can't reproduce it locally, it's difficult for me to suggest much else at this time (and I suspect similar of my colleagues).

@ptheywood
Copy link
Member

It's handled by a 3rd party lib, and I can't recall the reason it didn't/doesn't work properly on Google collab.

The version if jupyter/ipykernal on google colab has a bug that consumes stderr. This was fixed in ipykern in 2021, and I opened an issue with colab in 2021 about this (googlecolab/colabtools#2230), but colab is still running ipykernel 5.5.6.

FLAMEGPU/FLAMEGPU2-tutorial-python#10


I've attempted to reproduce your issue as well under linux with visualisation, but as @Robadob found the example compiles successfully for me, before hitting the known issue at runtime with wrapped communication, with commit b0ec5f3 (current master), nvcc 12.2.140, gcc 11.4.0.

cmake .. -DCMAKE_CUDA_ARCHITECTURES=86 -DFLAMEGPU_BUILD_PYTHON=ON -DFLAMEGPU_VISUALISATION=ON
cmake --build . --target pyflamegpu -j 8 
source lib/Release/python/venv/bin/activate
$ python3 ../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py 
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
Traceback (most recent call last):
  File "/home/ptheywood/code/flamegpu/FLAMEGPU2/build-12-2-vis/../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 430, in <module>
    cudaSimulation.simulate()
  File "/home/ptheywood/code/flamegpu/FLAMEGPU2/build-12-2-vis/lib/Release/python/venv/lib/python3.10/site-packages/pyflamegpu/pyflamegpu.py", line 9255, in simulate
    return _pyflamegpu.CUDASimulation_simulate(self, *args)
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (DeviceError) Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[12,0,0][864,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

Unfortunatley this does not help us narrow down the issues you are having.

Given you appear to have also built pyflamegpu from source successfully, it appears you must have completed the CUDA post-installation steps too (and LD_LIBRARY_PATH wouldn't be the issue).

You also don't appear to have multiple CUDA installations installed (atleast not via apt/dpkg) which would have been my next suggestion too of checkign the value of CUDA_HOME or CUDA_PATH environment variable at runtime.


Are you just running this in a bash terminal, or from within an editor's terminal or similar? Unless you are running via an old version of jupyter I'm not aware of any reasons why the stdout would not be getting printed.

@society-research
Copy link
Author

Are you just running this in a bash terminal, or from within an editor's terminal or similar? Unless you are running via an old version of jupyter I'm not aware of any reasons why the stdout would not be getting printed.

Yes, I'm running in a plain zsh nothing fancy around this shell.

Same issue post-runtime compilation that I was getting on Windows (with Visualisation) yesterday (already known #1177, with a few suitable workarounds).

I had to switch to another GPU hosting service, now I'm getting the same issue as mentioned in #1177 with both Release and Debug builds and can no longer reproduce the vanishing stdout.

So from my side this issue here is closed, since I can no longer reproduce it, thanks both of you for your quick support! 🙏

What are the workarounds for that issue? Just disable SEATBELTS?

@ptheywood
Copy link
Member

Yes, I'm running in a plain zsh nothing fancy around this shell.

It should be fine then as far as I'm aware.

So from my side this issue here is closed, since I can no longer reproduce it, thanks both of you for your quick support! 🙏

No problem, I'll close this issue for now but feel free to re-open it if you re-encounter the original problem.

What are the workarounds for that issue? Just disable SEATBELTS?

Yes, a build disabling seatbelts via FLAMEGPU_SEATBELTS=OFF should disable the radius factor check (but error messages will be less helpful in general unfortunately, although model runtimes will improve due to less checks).

Unfortuantely due to other commitments I'm not sure when I'll have time to fully resolve #1177 (via #1182).

@Robadob
Copy link
Member

Robadob commented Apr 7, 2024

What are the workarounds for that issue? Just disable SEATBELTS?

In addition to disabling seatbelts (FLAMEGPU_SEATBELTS=OFF), you can also switch to using non-wrapped spatial messages.

Or simply comment out the check (as you are compiling yourself), let me know if you would like the line numbers.
Adjusting the environment size slightly would also probably work, and might be a sensible temporary patch on our part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants