NVC performance with Vunit #1036

Blebowski · 2024-06-25T21:15:18Z

Hi,

I am trying to port my project:
https://github.com/Blebowski/CTU-CAN-FD

that I run with VUnit from GHDL into NVC. I managed to get everything working, and I can run the regression
with the same results in GHDL and in NVC.

The issue I face, is that NVC run-time is almost double the GHDL run-time (when executed via VUnit, I have
not tried to profile "raw" commands typed into the command line).

The analysis time of NVC is much smaller, but since analysis takes fraction of the overal regression, in total
run-time GHDL wins.

I would be curious to find out why does this occur. AFAICT, NVC is faster simulator.

My design contains large constant arrays of large structs that presumably take long time to elaborate.
When I run single test with -v, I see that VUnit always elaborates the design with NVC. Therefore,
each test requires separate elaboration since NVC does not support run-time setting of generics.

Does VUnit elaborate each test in GHDL too ? Or does Vunit in some smart manner passes the generics
such as test name, seed or other numeric parameters only to GHDL simulation ?

The text was updated successfully, but these errors were encountered:

LarsAsplund · 2024-06-25T21:43:38Z

I would suggest that you try first without VUnit to see if the differences are related to VUnit, the design you have, or differences in the simulators.

I can't think of any obvious reason for why VUnit would have this impact on performance. There are no special tricks related to how we pass generics. In both cases, we use the -g option.

Elaboration can take some time. I've noticed with GHDL that some of its advantages in terms of startup performance go away if many contexts are included in a testbench.

Generally speaking, I've experienced significant performance differences between simulators but it varies a lot with the design. X can be twice as fast as Y on one design and then Y is twice as fast as X for another.

For many (fast) unit tests, the raw simulator performance is irrelevant. It is the startup time of the simulator that dominates.

It should also be noted that the simulator interface we have for NVC is a contribution from @nickg so I would assume that there are no major performance-killing issues in that code.

nickg · 2024-06-25T22:23:35Z

@Blebowski can you provide some instructions for running the testbench? I found a run.py in that repository but looks like it needs some additional arguments.

Blebowski · 2024-06-26T06:46:49Z

Hi,
I will prepare a reproducer that compares these two.

Blebowski · 2024-06-26T22:17:29Z

Hi,

the steps to run the comparison:

git clone https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core.git ctu_can_fd_benchmark
cd ctu_can_fd_benchmark
git checkout port-regression-to-nvc
export CTU_TB_TOP_TARGET="tb_ctu_can_fd_rtl_vunit" 
cd test
VUNIT_SIMULATOR=ghdl ./run.py tb_rtl_test_fast_asic -p`nproc`

To run with NVC, just change the value of the VUNIT_SIMULATOR variable.

On my PC nproc has 20 cores. You can see that the individual run-times of
each simulation are much longer in NVC than in GHDL.

I first thought this is due to Vunit reusing single elaboration for GHDL, and
passing generics to elaborated binary in GHDL (GHDL docs claims it is supported).

When I use only single core at a time (no -p), I get better performance in NVC.
The overall run-time in such case is terrible of course since all the simulations
are executed one-by-one.

My results are:

ghdl.txt
nvc_jit.txt
nvc_no_jit.txt
nvc_no_parallel_runs.txt

I use NVC from couple of days ago. My GHDL and VUnit installations are from autumn
of last year, I hope this should not cause the issue.

Could this be caused by mutex that is held by NVC on compiled libraries during the
elaboration ? So if multiple elaborations are done at the same time due to -p,
actually only single elaboration can read the code compiled into libraries at a time ?

nickg · 2024-06-26T22:30:28Z

Can you trying setting the environment variable NVC_MAX_THREADS=1 with -p$(nproc)? NVC will create up to 8 background threads for JIT compilation which is probably too many here. Another combination to try might be NVC_MAX_THREADS=2 with -p$(($(nproc) / 2)).

Could this be caused by mutex that is held by NVC on compiled libraries during the
elaboration ?

Yes that might be an issue too. NVC uses a reader/writer lock on the library which means writing to the library requires there are no concurrent readers (but a library can be read concurrently by multiple processes). I think VUnit passes --no-save when elaborating which should avoid this however. Anyway thanks for the instructions, I'll try it at the weekend.

Blebowski · 2024-06-26T22:50:27Z

Hi,
setting the variable helps, the run-times are better. With JIT i now get to 15000 seconds of total run-time instead of 18000. Without JIT, I get to 14000 something.

nvc_max_threads_1.txt

nickg · 2024-06-29T21:39:47Z

The reference_data_set_* packages take a long time to compile. I think the difference with GHDL is that the GCC/LLVM backends compile these packages once whereas NVC is generating code for it each time it runs a simulation (GHDL mcode should have similar behaviour). I've made some improvements to speed this up, can you try with the latest master branch? I'll also do something to make NVC aware of the how many concurrent jobs VUnit is running to enable it to scale its worker threads accordingly.

See VUnit#1036 and nickg/nvc@0758f79d7c6c

nickg · 2024-06-30T09:00:53Z

Please also try with the VUnit changes in #1037.

See #1036 and nickg/nvc@0758f79d7c6c

Blebowski · 2024-07-01T15:35:17Z

Hi @nickg,

I will give it a try next weekend (currently vacationing) and let you know.

I confirm that the GHDL is using LLVM backend.

The reference data set packages make sense to compile slowly. These are huge constant arrays.
GHDL analysis them way longer than NVC. It makes sense with what you are saying. If GHDL emits
the code for these packages during the analysis, but NVC does so during elaboration, it is logical.

Blebowski · 2024-07-06T20:18:22Z

Hi,

@nickg , I have tested with latest NVC and VUnit commit you referenced. The results are much better:
nvc_jit_after_fix.txt

Now the overall runtime of the regression with -p 20 is just one and half minutes longer than with GHDL.
Clearly, this is caused by the longer elaborations and code being emmited for constants from reference_data_sets.
Short tests take less in GHDL (e.g. device_id), while long tests (e.g. data_set_*) take less time in NVC,
showing NVC simulation is indeed faster.

Also, it makes sense that GHDL emits code for reference_data_set_* only once, likely due to this,
it takes longer time to analyse those packages in GHDL.

The reference_data_set_* packages contain only long arrays of records (some golden CAN frame data).
I originally had this data in text files being read by the TB, but I converted them to packages to make TB
bring-up simpler in other frameworks or simulators (no file paths need to be provided, no relative/absolute
diffferences, etc...).

These long arrays of constants are only used in data_set_* tests, when a process gets triggered,
one of these arrays is assigned to a variable, and iterated through and sent to DUT:
reference_test_agent.vhd

Do you think it would be possible to emit the code for such long constant arrays only once the constant
gets accessed with --jit ? Then the code for these would be generated only in the data_set_* tests.
Sure, once the constants affect hierarchy, or width of a signal that always affects run-time model,
it would not be possible. But in my TB this is not the case, it is plain copy from constant into a variable
of the same type based on test name.

Blebowski · 2024-07-06T20:22:01Z

Either way, I will close this issue and thank you for your help.

This has actually unblocked me on porting my project also to NVC and finally trying to utilize the
coverage feature :)

nickg · 2024-07-06T21:34:10Z

Have you tried using a smaller number of concurrent jobs like -p 10? In my testing with current NVC/VUnit master it's faster than ghdl-gcc on all -p values up to 16 on a machine with 24 logical cores. I can't test more because there's not enough memory. Have you checked it isn't swapping with -p 20? One unfortunate issue with the LLVM JIT is that it uses a lot of memory: I saw each process was using about 1.5 GB.

Blebowski · 2024-07-06T22:30:22Z

Hi,

my CPU is 12th Gen intel with Big+Little cores (6 Power + 8 efficient). Power cores have hyperthread, so the rest 6 logical cores are from there. I have 64 GB of RAM, and I even with -p20 I have about 20 GBs left.

I re-ran with various -p values. The shortest "Elapsed time" is with 14 cores. Could that be explained by only 14 physical cores ? My guess would be yes.

As the number of cores grows, the total time spent by the simulation also grows. The differences are somewhat flaky though,
so at least 5 iterations of each would be good. If I have some more time I will try to write a script to profile it and spit out some charts.

nvc_jit_after_fix_p10.txt
nvc_jit_after_fix_p12.txt
nvc_jit_after_fix_p14.txt
nvc_jit_after_fix_p16.txt
nvc_jit_after_fix_p18.txt
nvc_jit_after_fix_p20.txt

Blebowski · 2024-07-06T22:46:00Z

Runtime of a test likely depends on core kind used to execute it. So, device_id running for around 4 seconds with -p10 and -p12, and 12-14 seconds on higher -p values can be caused by that. Beyond that, cache can have influence, but I don't know how to measure it.

Either way, with -p14 it is the best in my case and comparable to GHDL.
Actually better, because GHDL takes about 10 seconds to analyze each reference_data_set_*.vhd,
so the overall regression run-time is higher.

nickg · 2024-07-07T07:11:41Z

Are you using a build with --enable-debug? There's at least a 2x slow-down for elaboration with that on due to extra checking.

The shortest "Elapsed time" is with 14 cores. Could that be explained by only 14 physical cores ? My guess would be yes.

Each pair of hyperthreads shares an L1/L2 cache so if one thread is running it can use all of the cache whereas if both is running it's only guaranteed half of it. So you should see a higher rate of cache misses when both are active and VHDL simulations tend to be quite memory bound (i.e. they don't tend to stress the compute resources of the CPU).

Do you think it would be possible to emit the code for such long constant arrays only once the constant gets accessed with --jit ?

At some point I want to implement a cache for the JIT so that it can re-use machine code if the source code hasn't changed. But it's quite complex to get right so I probably won't do it soon.

Blebowski · 2024-07-07T16:29:10Z

No, I configure without --enable-debug.

At some point I want to implement a cache for the JIT so that it can re-use machine code if the source code hasn't changed. But it's quite complex to get right so I probably won't do it soon.

I am looking forward.

nickg mentioned this issue Jun 30, 2024

Set NVC_CONCURRENT_JOBS to value of num_threads argument #1037

Merged

nickg added a commit to nickg/vunit that referenced this issue Jun 30, 2024

Set NVC_CONCURRENT_JOBS to value of num_threads argument

2a7e892

See VUnit#1036 and nickg/nvc@0758f79d7c6c

LarsAsplund pushed a commit that referenced this issue Jun 30, 2024

Set NVC_CONCURRENT_JOBS to value of num_threads argument

fbe8d06

See #1036 and nickg/nvc@0758f79d7c6c

Blebowski closed this as completed Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVC performance with Vunit #1036

NVC performance with Vunit #1036

Blebowski commented Jun 25, 2024

LarsAsplund commented Jun 25, 2024

nickg commented Jun 25, 2024

Blebowski commented Jun 26, 2024

Blebowski commented Jun 26, 2024

nickg commented Jun 26, 2024

Blebowski commented Jun 26, 2024

nickg commented Jun 29, 2024

nickg commented Jun 30, 2024

Blebowski commented Jul 1, 2024

Blebowski commented Jul 6, 2024 •

edited

Loading

Blebowski commented Jul 6, 2024

nickg commented Jul 6, 2024

Blebowski commented Jul 6, 2024 •

edited

Loading

Blebowski commented Jul 6, 2024 •

edited

Loading

nickg commented Jul 7, 2024

Blebowski commented Jul 7, 2024

NVC performance with Vunit #1036

NVC performance with Vunit #1036

Comments

Blebowski commented Jun 25, 2024

LarsAsplund commented Jun 25, 2024

nickg commented Jun 25, 2024

Blebowski commented Jun 26, 2024

Blebowski commented Jun 26, 2024

nickg commented Jun 26, 2024

Blebowski commented Jun 26, 2024

nickg commented Jun 29, 2024

nickg commented Jun 30, 2024

Blebowski commented Jul 1, 2024

Blebowski commented Jul 6, 2024 • edited Loading

Blebowski commented Jul 6, 2024

nickg commented Jul 6, 2024

Blebowski commented Jul 6, 2024 • edited Loading

Blebowski commented Jul 6, 2024 • edited Loading

nickg commented Jul 7, 2024

Blebowski commented Jul 7, 2024

Blebowski commented Jul 6, 2024 •

edited

Loading

Blebowski commented Jul 6, 2024 •

edited

Loading

Blebowski commented Jul 6, 2024 •

edited

Loading