Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling #69

Open
4 tasks done
fiedl opened this issue May 26, 2018 · 18 comments
Open
4 tasks done

Profiling #69

fiedl opened this issue May 26, 2018 · 18 comments

Comments

@fiedl
Copy link
Owner

fiedl commented May 26, 2018

Which part of the new code does cost how much computation time?

Some considerations

  • Profiling within the kernel code and outputting the results is very imperformant itself and thus can only be done for a small number of photons, whereas the real performance issues arise when there are many photons as in flasher simulations (Is flasher study performance bad? #68).
  • In the GPU kernel, one can't just ask for the current time. But one can use the GPU clock. The clock counter are not nanoseconds, but are just GPU cycle ticks. Nevertheless, this gives an understanding of the relative amount of computation spent at the several parts of the code.

What needs to be done

  • Profiling the new medium propagation code: Layers, hole ice, sorting, medium loop.
  • What portion of the total propagation is spent in the new code?
  • How does that compare to standard clsim?
  • Are there differences when using GPU vs. CPU?

Results

bildschirmfoto 2018-05-27 um 01 19 02

image

bildschirmfoto 2018-08-02 um 17 34 45

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Profiling the propagation kernel on the cpu

Preparation: fiedl/clsim@7d7414d

[2018-05-26 13:26:30] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu

[2018-05-26 13:55:23] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log > tmp/profiling.txt

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

The individual steps of the medium propagation code make a small portion of the propagation simulation time.

[2018-05-26 13:56:45] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

bildschirmfoto 2018-05-26 um 14 59 26

bildschirmfoto 2018-05-26 um 14 59 31

bildschirmfoto 2018-05-26 um 15 00 43

fiedl added a commit that referenced this issue May 26, 2018
fiedl added a commit that referenced this issue May 26, 2018
@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

How does this compare to standard clsim?

Standard clsim handles the medium propagation differently. The original source can be found here: https://github.com/fiedl/clsim/blob/icesim-v05-00-07/resources/kernels/propagation_kernel.c.cl

I've copied the original source to clsim/resources/kernels/lib/propagation_through_media/standard_clsim.c.

The method can be switched from the propagation_kernel.c.cl by applying the desired method and commenting out the other one:

// propagation_kernel.c.cl
// ...

floating_t distancePropagated = 0;
floating_t distanceToAbsorption = 0;

// apply_propagation_through_different_media(
//   photonPosAndTime,
//   photonDirAndWlen,
//   #ifdef HOLE_ICE
//     numberOfCylinders,
//     cylinderPositionsAndRadii,
//     cylinderScatteringLengths,
//     cylinderAbsorptionLengths,
//   #endif
//   &sca_step_left,
//   &abs_lens_left,
//   &distancePropagated,
//   &distanceToAbsorption
// );

apply_propagation_through_different_media_with_standard_clsim(
  photonPosAndTime,
  photonDirAndWlen,
  &sca_step_left,
  &abs_lens_left,
  &distancePropagated,
  &distanceToAbsorption
);

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Run the profiling on the cpu with standard clsim:

[2018-05-26 15:44:19] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu

# Ctrl-C
9794,71s user 7443,22s system 99% cpu 4:48:27,72 total

I cancelled the job, because it ran forever. Also, it might not be comparable to the previous run, because this run has less profiling outputs, i.e. takes a shorter amount of time for profiling.

Nevertheless, try to get some histograms out of that run:

[2018-05-26 20:37:28] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

bildschirmfoto 2018-05-26 um 20 41 42

bildschirmfoto 2018-05-26 um 20 41 52

bildschirmfoto 2018-05-26 um 20 42 54

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

In order to make it comparable, profile hole-ice-2018 again, but without intermediate profiling steps. Profile just the whole scattering step and the portion of it that goes into the new hole-ice code.

[2018-05-26 21:23:59] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log

236,82s user 248,03s system 193% cpu 4:10,93 total

Some statistics:

[2018-05-26 21:28:34] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep "kernel statistics" tmp/clsim.log
PROFILINFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.69622e+08 nanoseconds/photon, 1024 photons (util: 95%) (Apple Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-26 21:33:31] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

bildschirmfoto 2018-05-26 um 22 17 29

bildschirmfoto 2018-05-26 um 22 17 48

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

And, to compare it properly, the same statistics with standard clsim:

[2018-05-26 21:35:21] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ rm tmp/*time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log

194,35s user 212,68s system 184% cpu 3:40,22 total

[2018-05-26 21:41:04] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

bildschirmfoto 2018-05-26 um 22 06 56

bildschirmfoto 2018-05-26 um 22 07 11

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

  • add mean and standard deviation to these plots.
  • calculate how much the profiling itself costs and subtract this

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

In order to get an idea how much the profiling itself costs, I'm adding a second profiling output at the same position in the code:

        clock_t t2 = clock();
        clock_t t1 = clock();
        apply_propagation_through_different_media(
          photonPosAndTime,
          photonDirAndWlen,
          #ifdef HOLE_ICE
            numberOfCylinders,
            cylinderPositionsAndRadii,
            cylinderScatteringLengths,
            cylinderAbsorptionLengths,
          #endif
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);
        printf("PROFILING apply_propagation_through_different_media_2 %lu\n", clock() - t2);

Without this extra output, the clock count has been 2.50 +- 2.61 (#69 (comment)).

With this extra output, the clock count is:

  • 3.16 +- 3.65 for the inner clock counter
  • 32.59 +- 32.70 for the outer clock counter
  • 56.74 +- 44.80 for the whole scattering step clock counter

This is really strange. Thus, I can't really tell from this how much the performance measurement itself costs. Repeating the same measurement does confirm the strange result.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Another attempt to measure the performance cost of the profiling itself: Drop the inner clock counter and measure just the overall scattering step.

        //clock_t t1 = clock();
        apply_propagation_through_different_media(
          photonPosAndTime,
          photonDirAndWlen,
          #ifdef HOLE_ICE
            numberOfCylinders,
            cylinderPositionsAndRadii,
            cylinderScatteringLengths,
            cylinderAbsorptionLengths,
          #endif
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        //printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);

When both the t0 counter and the t1 counter are active, the outer measurement is 30.02 +- 29.37.

When only the t0 counter is active, the outer measurement is 3.00 +- 3.79.

These results suggest that the dominant (most expensive) operation is the printf that outputs the performance measurement result, which costs about 25 ticks whereas the scattering step only costs about 5 ticks.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Repeat the same measurement of inner and outer clock counter, but print the result outside the counter.

        // start profiling the simulation step here
        clock_t t0 = clock();

        floating_t sca_step_left = -my_log(RNG_CALL_UNIFORM_OC);
        floating_t distancePropagated = 0;
        floating_t distanceToAbsorption = 0;
        
        clock_t t1 = clock();
        clock_t t2 = clock();
        apply_propagation_through_different_media(
          photonPosAndTime,
          photonDirAndWlen,
          #ifdef HOLE_ICE
            numberOfCylinders,
            cylinderPositionsAndRadii,
            cylinderScatteringLengths,
            cylinderAbsorptionLengths,
          #endif
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        clock_t t3 = clock();
        clock_t t4 = clock();

        // ...

        clock_t t5 = clock();
        printf("PROFILING propagation_kernel_simulation_step %lu\n", t5 - t0);
        printf("PROFILING apply_propagation_through_different_media %lu\n", t3 - t2);
        printf("PROFILING apply_propagation_through_different_media_outer %lu\n", t4 - t1);

bildschirmfoto 2018-05-26 um 23 20 35

This suggests that the performance measurement itself costs about 4.2 ticks. Thus, the overall scattering step is 3.5 ticks, including the the hole-ice-2018 code, which is 2.8 ticks. If this is correct, the hole-ice-2018 code costs 80% of the total scattering step.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Repeating the same for standard clsim:

        clock_t t1 = clock();
        clock_t t2 = clock();
        apply_propagation_through_different_media_with_standard_clsim(
          photonPosAndTime,
          photonDirAndWlen,
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        clock_t t3 = clock();
        clock_t t4 = clock();

bildschirmfoto 2018-05-26 um 23 32 16

With standard clsim, the performance measurement again is about 4.1 ticks. The overall scattering step is 3.3 ticks, the standard-clsim media propagation code is 2.5 ticks, which is 76% of the total scattering step.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Comparing the hole-ice-2018 algorithm to the standard-clsim algorithm:

When subtracting the performance measurement, the total scattering step costs:

  • hole-ice-2018: 3.5 ticks
  • standard clsim: 3.3 ticks

Thus, the hole-ice-2018 code costs about 6% more cpu time than standard clsim.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

GPU profiling

hole-ice-2018 code

[2018-05-26 23:45:26] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-26 23:45:57] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study
▶ git pull

[2018-05-26 23:45:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ rm tmp/*time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
5.14s user 3.93s system 12% cpu 1:15.24 total

[2018-05-26 23:49:15] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.54137e+06 nanoseconds/photon, 1005 photons (util: 4%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-26 23:50:31] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_gpu

[2018-05-27 00:00:50] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_gpu/profiling.txt

bildschirmfoto 2018-05-27 um 00 03 23

The performance measurement costs 117 ticks. The medium propagation costs 38022 ticks, the total scattering step costs 41060 ticks. The medium propagation makes 92.6% of the total scattering step gpu time.

fiedl added a commit that referenced this issue May 26, 2018
these results are dominated by the cost of the `printf` rather than the actual work.

See: #69 (comment)
fiedl added a commit to fiedl/clsim that referenced this issue May 26, 2018
@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

GPU profiling for standard clsim

[2018-05-26 23:53:19] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-26 23:51:34] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log

5.58s user 3.90s system 8% cpu 1:49.88 total

[2018-05-26 23:57:04] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.04102e+06 nanoseconds/photon, 1005 photons (util: 2%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-26 23:57:19] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/standard-clsim_gpu

[2018-05-27 00:03:38] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/standard-clsim_gpu/profiling.txt

bildschirmfoto 2018-05-27 um 00 08 19

Performance measurement itself costs 114 ticks. The standard-clsim media propagation costs 1773 ticks, the total scattering step costs 5571 ticks. The media propagation makes 32% of the total scattering step when running standard clsim on the gpu.

On the gpu, the hole-ice-2018 code costs 7.37 as much time as standard clsim.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Run detailed gpu profiling for the hole-ice-2018 code.

[2018-05-27 00:26:25] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-27 00:27:41] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log

1.95s user 1.44s system 3% cpu 1:31.76 total

[2018-05-27 00:29:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.96359e+06 nanoseconds/photon, 1005 photons (util: 10%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-27 00:30:17] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu

[2018-05-27 00:38:12] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu/profiling.txt

bildschirmfoto 2018-05-27 um 00 37 44

This is pretty unexpected. The start code that has nothing to do with the hole ice is most expensive. The relevant parts cost 2740 ticks. (Standard clsim medium propagation costs 1773 ticks.)

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

Why is the start code so expensive? Profile it in more detail:

  clock_t t0 = clock();
  int number_of_medium_changes = 0;
  clock_t t01 = clock();
  floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
  clock_t t02 = clock();
  int currentPhotonLayer = min(max(findLayerForGivenZPos(photonPosAndTime.z), 0), MEDIUM_LAYERS-1);
  clock_t t03 = clock();
  floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
  clock_t t04 = clock();
  floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};
  clock_t t05 = clock();
  const floating_t photonRange = *sca_step_left * local_scattering_lengths[0];
  clock_t t1 = clock();

  // ...

  printf("PROFILING start %lu\n", t1 - t0);
  printf("PROFILING start_01 %lu\n", t01 - t0);
  printf("PROFILING start_02 %lu\n", t02 - t01);
  printf("PROFILING start_03 %lu\n", t03 - t02);
  printf("PROFILING start_04 %lu\n", t04 - t03);
  printf("PROFILING start_05 %lu\n", t05 - t04);
  printf("PROFILING start_1 %lu\n", t1 - t05);
  printf("PROFILING add_ice_layers %lu\n", t2 - t1);
  printf("PROFILING add_hole_ice %lu\n", t3 - t2);
  printf("PROFILING sort %lu\n", t4 - t3);
  printf("PROFILING loop_over_media %lu\n", t5 - t4);
  printf("PROFILING total %lu\n", t5 - t0);
[2018-05-27 00:52:53] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_start_gpu/profiling.txt

bildschirmfoto 2018-05-27 um 00 53 46

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

From the detailed profiling of the start code, one can see that these three commands are the most expensive ones in the hole-ice-2018 code:

  floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
  floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
  floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};

As the first and the second have almost the same performance cost, the costly part is not getScatteringLength but the allocation of the MEDIUM_LAYERS-sized array.

@fiedl
Copy link
Owner Author

fiedl commented May 26, 2018

bildschirmfoto 2018-05-27 um 01 18 14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant