Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVC performance with Vunit #1036

Closed
Blebowski opened this issue Jun 25, 2024 · 16 comments
Closed

NVC performance with Vunit #1036

Blebowski opened this issue Jun 25, 2024 · 16 comments

Comments

@Blebowski
Copy link

Hi,

I am trying to port my project:
https://github.com/Blebowski/CTU-CAN-FD

that I run with VUnit from GHDL into NVC. I managed to get everything working, and I can run the regression
with the same results in GHDL and in NVC.

The issue I face, is that NVC run-time is almost double the GHDL run-time (when executed via VUnit, I have
not tried to profile "raw" commands typed into the command line).

The analysis time of NVC is much smaller, but since analysis takes fraction of the overal regression, in total
run-time GHDL wins.

I would be curious to find out why does this occur. AFAICT, NVC is faster simulator.

My design contains large constant arrays of large structs that presumably take long time to elaborate.
When I run single test with -v, I see that VUnit always elaborates the design with NVC. Therefore,
each test requires separate elaboration since NVC does not support run-time setting of generics.

Does VUnit elaborate each test in GHDL too ? Or does Vunit in some smart manner passes the generics
such as test name, seed or other numeric parameters only to GHDL simulation ?

@LarsAsplund
Copy link
Collaborator

I would suggest that you try first without VUnit to see if the differences are related to VUnit, the design you have, or differences in the simulators.

I can't think of any obvious reason for why VUnit would have this impact on performance. There are no special tricks related to how we pass generics. In both cases, we use the -g option.

Elaboration can take some time. I've noticed with GHDL that some of its advantages in terms of startup performance go away if many contexts are included in a testbench.

Generally speaking, I've experienced significant performance differences between simulators but it varies a lot with the design. X can be twice as fast as Y on one design and then Y is twice as fast as X for another.

For many (fast) unit tests, the raw simulator performance is irrelevant. It is the startup time of the simulator that dominates.

It should also be noted that the simulator interface we have for NVC is a contribution from @nickg so I would assume that there are no major performance-killing issues in that code.

@nickg
Copy link
Contributor

nickg commented Jun 25, 2024

@Blebowski can you provide some instructions for running the testbench? I found a run.py in that repository but looks like it needs some additional arguments.

@Blebowski
Copy link
Author

Hi,
I will prepare a reproducer that compares these two.

@Blebowski
Copy link
Author

Hi,

the steps to run the comparison:

git clone https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core.git ctu_can_fd_benchmark
cd ctu_can_fd_benchmark
git checkout port-regression-to-nvc
export CTU_TB_TOP_TARGET="tb_ctu_can_fd_rtl_vunit" 
cd test
VUNIT_SIMULATOR=ghdl ./run.py tb_rtl_test_fast_asic -p`nproc`

To run with NVC, just change the value of the VUNIT_SIMULATOR variable.

On my PC nproc has 20 cores. You can see that the individual run-times of
each simulation are much longer in NVC than in GHDL.

I first thought this is due to Vunit reusing single elaboration for GHDL, and
passing generics to elaborated binary in GHDL (GHDL docs claims it is supported).

When I use only single core at a time (no -p), I get better performance in NVC.
The overall run-time in such case is terrible of course since all the simulations
are executed one-by-one.

My results are:

ghdl.txt
nvc_jit.txt
nvc_no_jit.txt
nvc_no_parallel_runs.txt

I use NVC from couple of days ago. My GHDL and VUnit installations are from autumn
of last year, I hope this should not cause the issue.

Could this be caused by mutex that is held by NVC on compiled libraries during the
elaboration ? So if multiple elaborations are done at the same time due to -p,
actually only single elaboration can read the code compiled into libraries at a time ?

@nickg
Copy link
Contributor

nickg commented Jun 26, 2024

Can you trying setting the environment variable NVC_MAX_THREADS=1 with -p$(nproc)? NVC will create up to 8 background threads for JIT compilation which is probably too many here. Another combination to try might be NVC_MAX_THREADS=2 with -p$(($(nproc) / 2)).

Could this be caused by mutex that is held by NVC on compiled libraries during the
elaboration ?

Yes that might be an issue too. NVC uses a reader/writer lock on the library which means writing to the library requires there are no concurrent readers (but a library can be read concurrently by multiple processes). I think VUnit passes --no-save when elaborating which should avoid this however. Anyway thanks for the instructions, I'll try it at the weekend.

@Blebowski
Copy link
Author

Hi,
setting the variable helps, the run-times are better. With JIT i now get to 15000 seconds of total run-time instead of 18000. Without JIT, I get to 14000 something.

nvc_max_threads_1.txt

@nickg
Copy link
Contributor

nickg commented Jun 29, 2024

The reference_data_set_* packages take a long time to compile. I think the difference with GHDL is that the GCC/LLVM backends compile these packages once whereas NVC is generating code for it each time it runs a simulation (GHDL mcode should have similar behaviour). I've made some improvements to speed this up, can you try with the latest master branch? I'll also do something to make NVC aware of the how many concurrent jobs VUnit is running to enable it to scale its worker threads accordingly.

@nickg
Copy link
Contributor

nickg commented Jun 30, 2024

Please also try with the VUnit changes in #1037.

@Blebowski
Copy link
Author

Hi @nickg,

I will give it a try next weekend (currently vacationing) and let you know.

I confirm that the GHDL is using LLVM backend.

The reference data set packages make sense to compile slowly. These are huge constant arrays.
GHDL analysis them way longer than NVC. It makes sense with what you are saying. If GHDL emits
the code for these packages during the analysis, but NVC does so during elaboration, it is logical.

@Blebowski
Copy link
Author

Blebowski commented Jul 6, 2024

Hi,

@nickg , I have tested with latest NVC and VUnit commit you referenced. The results are much better:
nvc_jit_after_fix.txt

Now the overall runtime of the regression with -p 20 is just one and half minutes longer than with GHDL.
Clearly, this is caused by the longer elaborations and code being emmited for constants from reference_data_sets.
Short tests take less in GHDL (e.g. device_id), while long tests (e.g. data_set_*) take less time in NVC,
showing NVC simulation is indeed faster.

Also, it makes sense that GHDL emits code for reference_data_set_* only once, likely due to this,
it takes longer time to analyse those packages in GHDL.

The reference_data_set_* packages contain only long arrays of records (some golden CAN frame data).
I originally had this data in text files being read by the TB, but I converted them to packages to make TB
bring-up simpler in other frameworks or simulators (no file paths need to be provided, no relative/absolute
diffferences, etc...).

These long arrays of constants are only used in data_set_* tests, when a process gets triggered,
one of these arrays is assigned to a variable, and iterated through and sent to DUT:
reference_test_agent.vhd

Do you think it would be possible to emit the code for such long constant arrays only once the constant
gets accessed with --jit ? Then the code for these would be generated only in the data_set_* tests.
Sure, once the constants affect hierarchy, or width of a signal that always affects run-time model,
it would not be possible. But in my TB this is not the case, it is plain copy from constant into a variable
of the same type based on test name.

@Blebowski
Copy link
Author

Either way, I will close this issue and thank you for your help.

This has actually unblocked me on porting my project also to NVC and finally trying to utilize the
coverage feature :)

@nickg
Copy link
Contributor

nickg commented Jul 6, 2024

Have you tried using a smaller number of concurrent jobs like -p 10? In my testing with current NVC/VUnit master it's faster than ghdl-gcc on all -p values up to 16 on a machine with 24 logical cores. I can't test more because there's not enough memory. Have you checked it isn't swapping with -p 20? One unfortunate issue with the LLVM JIT is that it uses a lot of memory: I saw each process was using about 1.5 GB.

@Blebowski
Copy link
Author

Blebowski commented Jul 6, 2024

Hi,

my CPU is 12th Gen intel with Big+Little cores (6 Power + 8 efficient). Power cores have hyperthread, so the rest 6 logical cores are from there. I have 64 GB of RAM, and I even with -p20 I have about 20 GBs left.

I re-ran with various -p values. The shortest "Elapsed time" is with 14 cores. Could that be explained by only 14 physical cores ? My guess would be yes.

As the number of cores grows, the total time spent by the simulation also grows. The differences are somewhat flaky though,
so at least 5 iterations of each would be good. If I have some more time I will try to write a script to profile it and spit out some charts.

nvc_jit_after_fix_p10.txt
nvc_jit_after_fix_p12.txt
nvc_jit_after_fix_p14.txt
nvc_jit_after_fix_p16.txt
nvc_jit_after_fix_p18.txt
nvc_jit_after_fix_p20.txt

@Blebowski
Copy link
Author

Blebowski commented Jul 6, 2024

Runtime of a test likely depends on core kind used to execute it. So, device_id running for around 4 seconds with -p10 and -p12, and 12-14 seconds on higher -p values can be caused by that. Beyond that, cache can have influence, but I don't know how to measure it.

Either way, with -p14 it is the best in my case and comparable to GHDL.
Actually better, because GHDL takes about 10 seconds to analyze each reference_data_set_*.vhd,
so the overall regression run-time is higher.

@nickg
Copy link
Contributor

nickg commented Jul 7, 2024

Are you using a build with --enable-debug? There's at least a 2x slow-down for elaboration with that on due to extra checking.

The shortest "Elapsed time" is with 14 cores. Could that be explained by only 14 physical cores ? My guess would be yes.

Each pair of hyperthreads shares an L1/L2 cache so if one thread is running it can use all of the cache whereas if both is running it's only guaranteed half of it. So you should see a higher rate of cache misses when both are active and VHDL simulations tend to be quite memory bound (i.e. they don't tend to stress the compute resources of the CPU).

Do you think it would be possible to emit the code for such long constant arrays only once the constant gets accessed with --jit ?

At some point I want to implement a cache for the JIT so that it can re-use machine code if the source code hasn't changed. But it's quite complex to get right so I probably won't do it soon.

@Blebowski
Copy link
Author

No, I configure without --enable-debug.

At some point I want to implement a cache for the JIT so that it can re-use machine code if the source code hasn't changed. But it's quite complex to get right so I probably won't do it soon.

I am looking forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants