Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add utility to perform timings and some performance improvements #1237

Merged
merged 9 commits into from
Sep 9, 2023

Conversation

KrisThielemans
Copy link
Collaborator

Allow standardised timings. Could do with other tests of course. @gschramm @markus-jehl @NicoleJurjew want to have a look?

@gschramm
Copy link
Contributor

Thanks Kris. This looks super interesting. Unfortuantely, I am currently busy relocating to Leuven and moving to a new house. @markus-jehl could you have a first look?

@KrisThielemans
Copy link
Collaborator Author

WARNING: This branch will be subject to rebases etc and force-pushed occasionally to keep history clean.

TimedObject is not thread-safe, and timing results were incorrect.
Currently just remove the calls.

work-around UCL#1238
The loop to construct xstart/end etc is now multi-threaded (although a little bit uglier!).

Testing shows a speed-up of about 2-3. Using too many threads is counterproductive, so
I limited to 8 (not necessarily optimal!).
Timers were stopped too early due to nested calls. This is now
checked by asserts (by adding HighResWallClockTimer), allowing me
to catch these problems.
@KrisThielemans
Copy link
Collaborator Author

Example timings that I'm currently getting on my desktop (AMD Ryzen 9 5900 12-Core Processor, 3001 Mhz, 12 Core(s), 24 Logical Processor(s); 32GB RAM; GEForce RTX 3070; WSL2 with gcc 11.4.0 and nvcc 12.2) for a similar set-up as @gschramm https://arxiv.org/pdf/2212.12519v1.pdf, i.e. DMI 4-ring span=1 but only 8 views, 215x215x71 image

copy_image                                                 1.250                           1.235
PMRT_projector_setup                                     141.125                           6.051
PMRT_forward_file_first                                13050.000                         731.934
PMRT_forward_file                                      12590.000                         535.147
PMRT_forward_memory                                    12740.000                         536.942
PMRT_back_file_first                                   15590.000                         924.100
PMRT_back_file                                         16700.000                         771.430
PMRT_back_memory                                       16790.000                         765.870
PP_projector_setup                                       810.000                         116.063
PP_forward_file_first                                    970.000                        1094.286
PP_forward_file                                          970.000                          68.710
PP_forward_memory                                        958.750                          68.178
PP_back_file_first                                      1200.000                         490.020
PP_back_file                                            1317.500                         256.327
PP_back_memory                                          1330.000                         264.932

with first column CPU and 2nd wall-clock time, both in ms.

For comparison with all 272 views

copy_image                                                 1.250                           1.508
PMRT_projector_setup                                     139.000                           6.327
PMRT_forward_file_first                               423650.000                       42344.702
PMRT_forward_file                                     469500.000                       20058.765
PMRT_forward_memory                                   480970.000                       20464.665
PMRT_back_file_first                                  557800.000                       23641.086
PMRT_back_file                                        559140.000                       23672.374
PMRT_back_memory                                      559940.000                       23645.234
PP_projector_setup                                     20290.000                        9662.081
PP_forward_file_first                                  38420.000                        3550.294
PP_forward_file                                        18885.000                        1765.791
PP_forward_memory                                      12270.000                        1436.956
PP_back_file_first                                     14760.000                        2693.606
PP_back_file                                           14645.000                        2621.722
PP_back_memory                                         13975.000                        2574.534

Currently, #1236 doesn't make a lot of difference (PP_forward_file_first is slower, PP_back_file_first is faster. No idea why).

Template files (had to rename as .txt for GitHub upload)
DMI4.hs.txt
DMI4.hv.txt
DMI4_8v.hs.txt, i.e. 8 views only

Running OSEM is still slow with subsets due to GPU projector set-up. That needs some thought.

@KrisThielemans
Copy link
Collaborator Author

KrisThielemans commented Aug 29, 2023

One factor slowing down the parallelproj projections is the call to truncate_rim to restrict to cylindrical FOV, see here. @gschramm do we still need that?

In any case, loops in truncate_rim should be rewritten to only loop x over voxels outside the radius, as opposed to all.

@gschramm
Copy link
Contributor

Example timings that I'm currently getting on my desktop (AMD Ryzen 9 5900 12-Core Processor, 3001 Mhz, 12 Core(s), 24 Logical Processor(s); 32GB RAM; GEForce RTX 3070; WSL2 with gcc 11.4.0 and nvcc 12.2) for a similar set-up as @gschramm https://arxiv.org/pdf/2212.12519v1.pdf, i.e. DMI 4-ring span=1 but only 8 views, 215x215x71 image

copy_image                                                 1.250                           1.235
PMRT_projector_setup                                     141.125                           6.051
PMRT_forward_file_first                                13050.000                         731.934
PMRT_forward_file                                      12590.000                         535.147
PMRT_forward_memory                                    12740.000                         536.942
PMRT_back_file_first                                   15590.000                         924.100
PMRT_back_file                                         16700.000                         771.430
PMRT_back_memory                                       16790.000                         765.870
PP_projector_setup                                       810.000                         116.063
PP_forward_file_first                                    970.000                        1094.286
PP_forward_file                                          970.000                          68.710
PP_forward_memory                                        958.750                          68.178
PP_back_file_first                                      1200.000                         490.020
PP_back_file                                            1317.500                         256.327
PP_back_memory                                          1330.000                         264.932

with first column CPU and 2nd wall-clock time, both in ms.

For comparison with all 272 views

copy_image                                                 1.250                           1.508
PMRT_projector_setup                                     139.000                           6.327
PMRT_forward_file_first                               423650.000                       42344.702
PMRT_forward_file                                     469500.000                       20058.765
PMRT_forward_memory                                   480970.000                       20464.665
PMRT_back_file_first                                  557800.000                       23641.086
PMRT_back_file                                        559140.000                       23672.374
PMRT_back_memory                                      559940.000                       23645.234
PP_projector_setup                                     20290.000                        9662.081
PP_forward_file_first                                  38420.000                        3550.294
PP_forward_file                                        18885.000                        1765.791
PP_forward_memory                                      12270.000                        1436.956
PP_back_file_first                                     14760.000                        2693.606
PP_back_file                                           14645.000                        2621.722
PP_back_memory                                         13975.000                        2574.534

Currently, #1236 doesn't make a lot of difference (PP_forward_file_first is slower, PP_back_file_first is faster. No idea why).

Template files (had to rename as .txt for GitHub upload) DMI4.hs.txt DMI4.hv.txt DMI4_8v.hs.txt, i.e. 8 views only

Running OSEM is still slow with subsets due to GPU projector set-up. That needs some thought.

Very interesting comparison. Thanks a lot Kris! How do I interpret PP_forward_file_first, PP_forward_file, and PP_forward_memory?

@gschramm
Copy link
Contributor

One factor slowing down the parallelproj projections is the call to truncate_rim to restrict to cylindrical FOV, see here. @gschramm do we still need that?

In any case, loops in truncate_rim should be rewritten to only loop x over voxels outside the radius, as opposed to all.

I don't remember 100% why we added that. The projectors themselves shouldn't care about the FOV.

@KrisThielemans
Copy link
Collaborator Author

How do I interpret PP_forward_file_first, PP_forward_file, and PP_forward_memory?

sorry. first is do it once straight after construction+set_up as the underlying object will change. Then it repeats it for a number of runs, and the average timing of those is then reported. file means it will write the result to file, memory means it won't. Looks like I have a fast SSD... (The difference between first and subsequent surprised me. I didn't check why, or if it's a bug in my timings!).

(note that it's the set_up that computes the end-points. They then get stored in an std::vector)

@KrisThielemans
Copy link
Collaborator Author

One good think to add would be an OSEM update to the timings. This should be done, but it might be different from what @gschramm reports, as we use the "additive term" normally (I guess I could run without).

@gschramm
Copy link
Contributor

Hi Kris,
I just checked with a minimal forward and back projection of a single LOR and parallelproj v1.5.
I don't see a reason why the limitation to the cylindrical FOV is needed.
The fwd and back projection between the LOR start and end point works as expected, even if the
start / end points are within the image (I remember someone reported an issue related to that).

Georg

@KrisThielemans
Copy link
Collaborator Author

Running without truncate_rim actually gives very little difference. I also saw a faster set_up for the "full" case, which I guess means the computer was busy doing some other stuff in the previous run.

PP_projector_setup    24660.000                        4562.182

This is of course always going to be tricky. (Note sure if people ever report a "minimum wall clock" time to avoid this).

@KrisThielemans KrisThielemans changed the title add utility to perform timings add utility to perform timings and some performance improvements Aug 30, 2023
@markus-jehl
Copy link
Contributor

Here are the timings on my machine (Intel Xeon CPU E5-2699 v3@2.30GHz; 18 cores; 256GB RAM; NVIDIA Quadro M4000; WSL2 with clang 14.0.0-1ubuntu1 and nvcc V12.0.140) for different templates. Unfortunately I still haven't found a solution for the extremely slow caching of the system matrix that happens in the first projection (most likely caused by WSL2/Docker memory allocation), and don't have a GPU on the native Ubuntu system to compare timings there. Interestingly, though, it doesn't appear to be as bad for the DMI geometry!

DMI4_8v:

PMRT_projector_setup                                    1235.000                          38.053
PMRT_forward_file_first                                45360.000                        2621.903
PMRT_forward_file                                      54150.000                        1663.641
PMRT_forward_memory                                    54820.000                        1568.988
PMRT_back_file_first                                   82790.000                        6372.559
PMRT_back_file                                         91310.000                        2984.327
PMRT_back_memory                                       93980.000                        3041.237
PP_projector_setup                                      2730.000                         329.540
PP_forward_file_first                                  30450.000                        1422.549
PP_forward_file                                        14214.000                         841.846
PP_forward_memory                                      11193.000                         781.549
PP_back_file_first                                     24780.000                        6539.762
PP_back_file                                           25674.000                        3342.135
PP_back_memory                                         25368.000                        3125.126

DMI4:

PMRT_projector_setup                                    1816.700                          54.978
PMRT_forward_file_first                              1306680.000                       82427.485
PMRT_forward_file                                    2175610.000                       63507.069
PMRT_forward_memory                                  1736600.000                       48904.101
PMRT_back_file_first                                 2843670.000                       83610.838
PMRT_back_file                                       2849450.000                       80436.218
PMRT_back_memory                                     2861700.000                       80760.121
PP_projector_setup                                     50840.000                        7097.105
PP_forward_file_first                                 120530.000                       15386.625
PP_forward_file                                       131839.000                       15608.557
PP_forward_memory                                      59788.000                       13018.664
PP_back_file_first                                     55170.000                       29732.630
PP_back_file                                          141154.000                       28716.604
PP_back_memory                                         96673.000                       27483.875

NeuroLF:

PMRT_projector_setup                                    1412.600                          44.311
PMRT_forward_file_first                              2860280.000                     1407126.362
PMRT_forward_file                                     386400.000                       11259.394
PMRT_forward_memory                                   399230.000                       11213.618
PMRT_back_file_first                                  632680.000                       19305.444
PMRT_back_file                                        604560.000                       17057.971
PMRT_back_memory                                      606780.000                       17079.259
PP_projector_setup                                    218550.000                       27038.826
PP_forward_file_first                                 114650.000                        6001.208
PP_forward_file                                       108945.000                        5852.599
PP_forward_memory                                      42513.000                        3514.213
PP_back_file_first                                     26340.000                        6817.084
PP_back_file                                           56988.000                        5708.406
PP_back_memory                                         58234.000                        5847.578

@KrisThielemans
Copy link
Collaborator Author

thanks @markus-jehl. Seems that my system is about twice as far as yours, also for parallelproj (could be that its performance is dominated by the CPU as well). Quite weird about your NeuroLF PMRT "first run" timings. Maybe you could compare memory usage.

Aside from timing other things, I think we'll need some client code to be able to make some nice plots for different systems etc, as this will soon get unmanageable.

@KrisThielemans
Copy link
Collaborator Author

This seems clean enough to merge now. We can always add some more later.

I've added a log-likelihood run (set-up: currently essentially computation of sensitivity; "grad_no_sens" essentially the MLEM computation back(data/forw(image))) and a few options.

@KrisThielemans KrisThielemans merged commit 3159ebc into UCL:master Sep 9, 2023
6 checks passed
@KrisThielemans KrisThielemans deleted the timings branch September 9, 2023 23:35
@KrisThielemans KrisThielemans added this to the v5.2 milestone Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants