-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling #69
Comments
Profiling the propagation kernel on the cpuPreparation: fiedl/clsim@7d7414d [2018-05-26 13:26:30] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu
[2018-05-26 13:55:23] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log > tmp/profiling.txt |
How does this compare to standard clsim?Standard clsim handles the medium propagation differently. The original source can be found here: https://github.com/fiedl/clsim/blob/icesim-v05-00-07/resources/kernels/propagation_kernel.c.cl I've copied the original source to The method can be switched from the // propagation_kernel.c.cl
// ...
floating_t distancePropagated = 0;
floating_t distanceToAbsorption = 0;
// apply_propagation_through_different_media(
// photonPosAndTime,
// photonDirAndWlen,
// #ifdef HOLE_ICE
// numberOfCylinders,
// cylinderPositionsAndRadii,
// cylinderScatteringLengths,
// cylinderAbsorptionLengths,
// #endif
// &sca_step_left,
// &abs_lens_left,
// &distancePropagated,
// &distanceToAbsorption
// );
apply_propagation_through_different_media_with_standard_clsim(
photonPosAndTime,
photonDirAndWlen,
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
); |
Run the profiling on the cpu with standard clsim: [2018-05-26 15:44:19] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu
# Ctrl-C
9794,71s user 7443,22s system 99% cpu 4:48:27,72 total I cancelled the job, because it ran forever. Also, it might not be comparable to the previous run, because this run has less profiling outputs, i.e. takes a shorter amount of time for profiling. Nevertheless, try to get some histograms out of that run: [2018-05-26 20:37:28] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt |
In order to make it comparable, profile hole-ice-2018 again, but without intermediate profiling steps. Profile just the whole scattering step and the portion of it that goes into the new hole-ice code. [2018-05-26 21:23:59] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log
236,82s user 248,03s system 193% cpu 4:10,93 total Some statistics: [2018-05-26 21:28:34] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep "kernel statistics" tmp/clsim.log
PROFILINFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.69622e+08 nanoseconds/photon, 1024 photons (util: 95%) (Apple Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-26 21:33:31] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt |
And, to compare it properly, the same statistics with standard clsim: [2018-05-26 21:35:21] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ rm tmp/*
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log
194,35s user 212,68s system 184% cpu 3:40,22 total
[2018-05-26 21:41:04] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt |
|
In order to get an idea how much the profiling itself costs, I'm adding a second profiling output at the same position in the code: clock_t t2 = clock();
clock_t t1 = clock();
apply_propagation_through_different_media(
photonPosAndTime,
photonDirAndWlen,
#ifdef HOLE_ICE
numberOfCylinders,
cylinderPositionsAndRadii,
cylinderScatteringLengths,
cylinderAbsorptionLengths,
#endif
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);
printf("PROFILING apply_propagation_through_different_media_2 %lu\n", clock() - t2); Without this extra output, the clock count has been 2.50 +- 2.61 (#69 (comment)). With this extra output, the clock count is:
This is really strange. Thus, I can't really tell from this how much the performance measurement itself costs. Repeating the same measurement does confirm the strange result. |
Another attempt to measure the performance cost of the profiling itself: Drop the inner clock counter and measure just the overall scattering step. //clock_t t1 = clock();
apply_propagation_through_different_media(
photonPosAndTime,
photonDirAndWlen,
#ifdef HOLE_ICE
numberOfCylinders,
cylinderPositionsAndRadii,
cylinderScatteringLengths,
cylinderAbsorptionLengths,
#endif
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
//printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1); When both the t0 counter and the t1 counter are active, the outer measurement is 30.02 +- 29.37. When only the t0 counter is active, the outer measurement is 3.00 +- 3.79. These results suggest that the dominant (most expensive) operation is the |
Repeat the same measurement of inner and outer clock counter, but print the result outside the counter. // start profiling the simulation step here
clock_t t0 = clock();
floating_t sca_step_left = -my_log(RNG_CALL_UNIFORM_OC);
floating_t distancePropagated = 0;
floating_t distanceToAbsorption = 0;
clock_t t1 = clock();
clock_t t2 = clock();
apply_propagation_through_different_media(
photonPosAndTime,
photonDirAndWlen,
#ifdef HOLE_ICE
numberOfCylinders,
cylinderPositionsAndRadii,
cylinderScatteringLengths,
cylinderAbsorptionLengths,
#endif
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
clock_t t3 = clock();
clock_t t4 = clock();
// ...
clock_t t5 = clock();
printf("PROFILING propagation_kernel_simulation_step %lu\n", t5 - t0);
printf("PROFILING apply_propagation_through_different_media %lu\n", t3 - t2);
printf("PROFILING apply_propagation_through_different_media_outer %lu\n", t4 - t1); This suggests that the performance measurement itself costs about 4.2 ticks. Thus, the overall scattering step is 3.5 ticks, including the the hole-ice-2018 code, which is 2.8 ticks. If this is correct, the hole-ice-2018 code costs 80% of the total scattering step. |
Repeating the same for standard clsim: clock_t t1 = clock();
clock_t t2 = clock();
apply_propagation_through_different_media_with_standard_clsim(
photonPosAndTime,
photonDirAndWlen,
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
clock_t t3 = clock();
clock_t t4 = clock(); With standard clsim, the performance measurement again is about 4.1 ticks. The overall scattering step is 3.3 ticks, the standard-clsim media propagation code is 2.5 ticks, which is 76% of the total scattering step. |
Comparing the hole-ice-2018 algorithm to the standard-clsim algorithm: When subtracting the performance measurement, the total scattering step costs:
Thus, the hole-ice-2018 code costs about 6% more cpu time than standard clsim. |
GPU profilinghole-ice-2018 code[2018-05-26 23:45:26] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull
[2018-05-26 23:45:57] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study
▶ git pull
[2018-05-26 23:45:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ rm tmp/*
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
5.14s user 3.93s system 12% cpu 1:15.24 total
[2018-05-26 23:49:15] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.54137e+06 nanoseconds/photon, 1005 photons (util: 4%) (NVIDIA CUDA Tesla K80) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-26 23:50:31] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_gpu
[2018-05-27 00:00:50] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_gpu/profiling.txt The performance measurement costs 117 ticks. The medium propagation costs 38022 ticks, the total scattering step costs 41060 ticks. The medium propagation makes 92.6% of the total scattering step gpu time. |
these results are dominated by the cost of the `printf` rather than the actual work. See: #69 (comment)
GPU profiling for standard clsim[2018-05-26 23:53:19] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull
[2018-05-26 23:51:34] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
5.58s user 3.90s system 8% cpu 1:49.88 total
[2018-05-26 23:57:04] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.04102e+06 nanoseconds/photon, 1005 photons (util: 2%) (NVIDIA CUDA Tesla K80) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-26 23:57:19] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/standard-clsim_gpu
[2018-05-27 00:03:38] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/standard-clsim_gpu/profiling.txt Performance measurement itself costs 114 ticks. The standard-clsim media propagation costs 1773 ticks, the total scattering step costs 5571 ticks. The media propagation makes 32% of the total scattering step when running standard clsim on the gpu. On the gpu, the hole-ice-2018 code costs 7.37 as much time as standard clsim. |
Run detailed gpu profiling for the hole-ice-2018 code. [2018-05-27 00:26:25] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull
[2018-05-27 00:27:41] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
1.95s user 1.44s system 3% cpu 1:31.76 total
[2018-05-27 00:29:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.96359e+06 nanoseconds/photon, 1005 photons (util: 10%) (NVIDIA CUDA Tesla K80) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-27 00:30:17] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu
[2018-05-27 00:38:12] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu/profiling.txt This is pretty unexpected. The start code that has nothing to do with the hole ice is most expensive. The relevant parts cost 2740 ticks. (Standard clsim medium propagation costs 1773 ticks.) |
Why is the start code so expensive? Profile it in more detail: clock_t t0 = clock();
int number_of_medium_changes = 0;
clock_t t01 = clock();
floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
clock_t t02 = clock();
int currentPhotonLayer = min(max(findLayerForGivenZPos(photonPosAndTime.z), 0), MEDIUM_LAYERS-1);
clock_t t03 = clock();
floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
clock_t t04 = clock();
floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};
clock_t t05 = clock();
const floating_t photonRange = *sca_step_left * local_scattering_lengths[0];
clock_t t1 = clock();
// ...
printf("PROFILING start %lu\n", t1 - t0);
printf("PROFILING start_01 %lu\n", t01 - t0);
printf("PROFILING start_02 %lu\n", t02 - t01);
printf("PROFILING start_03 %lu\n", t03 - t02);
printf("PROFILING start_04 %lu\n", t04 - t03);
printf("PROFILING start_05 %lu\n", t05 - t04);
printf("PROFILING start_1 %lu\n", t1 - t05);
printf("PROFILING add_ice_layers %lu\n", t2 - t1);
printf("PROFILING add_hole_ice %lu\n", t3 - t2);
printf("PROFILING sort %lu\n", t4 - t3);
printf("PROFILING loop_over_media %lu\n", t5 - t4);
printf("PROFILING total %lu\n", t5 - t0);
|
From the detailed profiling of the start code, one can see that these three commands are the most expensive ones in the hole-ice-2018 code: floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)}; As the first and the second have almost the same performance cost, the costly part is not |
Which part of the new code does cost how much computation time?
Some considerations
What needs to be done
Results
The text was updated successfully, but these errors were encountered: