Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop tiling for step_curl using cache-oblivious recursive algorithm #1655

Merged
merged 8 commits into from
Aug 11, 2021

Conversation

oskooi
Copy link
Collaborator

@oskooi oskooi commented Jul 5, 2021

Initial (and incomplete) attempt to extend the revamped API for STEP_CURL and STEP_UPDATE_EDHB introduced in #1653 to support loop tiling using a cache-oblivious recursive algorithm. The focus is to first verify the correctness of the overall approach.

This PR introduces a new member std::vector<grid_volume> gvs to the fields_chunks class which stores an array of the subdomains or "tiles" of each chunk generated using a recursive-bisection algorithm implemented in split_into_tiles. The nice thing is is that we can reuse the existing grid_volume class functions find_best_split and split_at_fraction for the cell-partitioning feature inside split_into_tiles. The tiles are generated once as part of the fields_chunk constructor although in general they would need to be defined every time chunk_connections_valid = false. A loop over the tiles is then defined around STEP_CURL inside fields_chunk::step_db (a similar approach would apply to STEP_UPDATE_EDHB inside fields_chunk::update_eh). For now, only Cartesian coordinates are supported and we can deal with Cylindrical coordinates separately after this is working.

There is a hard-coded value base_nowned_min inside split_into_tiles which determines the size of the smallest domain in the recursion. In my preliminary benchmarking results, adjusting this value from 100 and 1000 changed the average time-stepping rate by ~20% for a test case involving an OLED simulation with 4 MPI processes on a single machine. I will post more results shortly. Unfortunately, this PR is causing some of the unit tests to fail which probably needs to be resolved first.

@stevengj
Copy link
Collaborator

stevengj commented Jul 5, 2021

Should I merge #1653 first? That PR should be harmless in any case.

src/step_db.cpp Outdated Show resolved Hide resolved
@oskooi oskooi force-pushed the cache_oblivious_step_curl branch from 2263ac8 to 304a402 Compare July 5, 2021 17:20
src/step_db.cpp Outdated Show resolved Hide resolved
src/fields.cpp Outdated Show resolved Hide resolved
src/step_db.cpp Outdated Show resolved Hide resolved
@stevengj
Copy link
Collaborator

stevengj commented Jul 7, 2021

See lectures 14 and 15 in these OCW lectures. This is about accessing the memory in the right order (improving "temporal locality"), and once you do that the cache hardware should do the rest.

(The base case of the recursion has nothing to do with cache — it is "recursion coarsening" to amortize the function-call overhead. You want to shrink the base case as much as possible until you start seeing performance degrade, which will eventually happen due to function-call overhead.)

Note also that this is only about improving serial (non-parallel) performance.

src/fields.cpp Outdated Show resolved Hide resolved
src/fields.cpp Outdated Show resolved Hide resolved
@oskooi
Copy link
Collaborator Author

oskooi commented Jul 9, 2021

Even with the fix inside fields_chunk::step_db when calling STEP_CURL related to the first argument, some of the tests are still failing. The strange thing is that changing the size of the smallest domain via the base_nowned_min parameter of split_into_tiles (i.e., from 1000 to 20000) is causing different tests to fail. The tiling seems to span the entire chunk grid_volume (this is verified by the function check_tiles). This should mean that all the grid points are being updated as part of the call to fields::step_db but somehow this does not seem to be the case due to the failing tests.

src/fields.cpp Outdated Show resolved Hide resolved
@oskooi
Copy link
Collaborator Author

oskooi commented Jul 9, 2021

For some reason, the CI tests seem to all be passing even though they are still failing on my local build (as I reported previously).

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 13, 2021

Based on some experiments, this PR seems to only work if the minimum tile volume base_nowned_min (i.e., the number of grid points in the smallest subdomain) is above a value of ~12000 which is the default value in this PR. For values below 12000, two of the Python unit tests (test_cyl_ellipsoid.py, test_simulation.py) are failing due to the fields blowing up. The tiles do indeed span the entire chunk which means all the grid points are being updated. It is not clear why there is a qualitative change in behavior below this minimum tile size.

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 21, 2021

I have tried just splitting the chunk in two along its longest direction but it still doesn't seem to work (i.e., some of the unit tests are failing).

master

      STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
                gv, gv.little_owned_corner0(cc), gv.big_corner(),
                Courant, dsig, s->sig[dsig],
                s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
                s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
                f_cond[cc][cmp]);

proposed change

      direction longest_axis = NO_DIRECTION;
      int num_in_longest_axis = 0;
      LOOP_OVER_DIRECTIONS(gv.dim, d) {
	if (gv.num_direction(d) > num_in_longest_axis) {
          longest_axis = d;
          num_in_longest_axis = gv.num_direction(d);
        }
      }
      int split_pt = gv.num_direction(longest_axis) / 2;

      grid_volume side_low;
      side_low.dim = gv.dim;
      side_low.a = gv.a;
      LOOP_OVER_DIRECTIONS(gv.dim, d) {
        side_low.set_num_direction(d, gv.num_direction(d));
      }
      side_low.set_origin(gv.little_owned_corner0(cc));
      side_low.set_num_direction(longest_axis, split_pt);

      grid_volume side_high;
      side_high.dim = gv.dim;
      side_high.a = gv.a;
      LOOP_OVER_DIRECTIONS(gv.dim, d) {
        side_high.set_num_direction(d, gv.num_direction(d));
      }
      side_high.set_origin(gv.little_owned_corner0(cc));
      side_high.shift_origin(longest_axis, split_pt * 2);
      side_high.set_num_direction(longest_axis, side_high.num_direction(longest_axis) - split_pt);

      ivec is = gv.little_owned_corner0(cc);
      ivec ie = min(side_low.big_corner() + gv.iyee_shift(cc), side_high.little_owned_corner0(cc));
      STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
                gv, is, ie,
                Courant, dsig, s->sig[dsig],
                s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
                s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
                f_cond[cc][cmp]);

      is = max(side_low.big_corner() + gv.iyee_shift(cc), side_high.little_owned_corner0(cc));
      ie = gv.big_corner();
      STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
                gv, is, ie,
                Courant, dsig, s->sig[dsig],
                s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
                s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
                f_cond[cc][cmp]);

@stevengj
Copy link
Collaborator

We really should have a copy constructor for grid_volume

@stevengj
Copy link
Collaborator

Would be good to check that the side_low's ie is < the side_high's is along longest_axis (they should differ by 1 or 2, I think, depending on the component cc).

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 22, 2021

Outputting the is and ie for side_low and side_high shows that they are consistent but still the tests are failing.

      master_printf("split direction: %s, %d\n",direction_name(longest_axis),split_pt);
      ivec is = gv.little_owned_corner0(cc);
      ivec ie = side_low.big_corner();
      master_printf("side_low (%s):, is=(%d,%d,%d), ie=(%d,%d,%d)\n",
                    component_name(cc),
                    is.x(),is.y(),is.z(),ie.x(),ie.y(),ie.z());
      STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
                gv, is, ie,
                Courant, dsig, s->sig[dsig],
                s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
                s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
                f_cond[cc][cmp]);

      is = side_high.little_owned_corner0(cc);
      ie = gv.big_corner();
      master_printf("side_high (%s):, is=(%d,%d,%d), ie=(%d,%d,%d)\n",
                    component_name(cc),
                    is.x(),is.y(),is.z(),ie.x(),ie.y(),ie.z());
      STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
                gv, is, ie,
                Courant, dsig, s->sig[dsig],
                s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
                s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
                f_cond[cc][cmp]);

output for a 3d test problem

split direction: x, 55
side_low (bx):, is=(-108,-109,-121), ie=(2,111,-1)
side_high (bx):, is=(4,-108,-120), ie=(110,110,-2)
split direction: x, 55
side_low (by):, is=(-109,-108,-121), ie=(1,112,-1)
side_high (by):, is=(2,-106,-120), ie=(110,110,-2)
split direction: x, 55
side_low (bz):, is=(-109,-109,-120), ie=(1,111,0)
side_high (bz):, is=(2,-108,-118), ie=(110,110,-2)
split direction: x, 55
side_low (dx):, is=(-109,-108,-120), ie=(1,112,0)
side_high (dx):, is=(2,-106,-118), ie=(110,110,-2)
split direction: x, 55
side_low (dy):, is=(-108,-109,-120), ie=(2,111,0)
side_high (dy):, is=(4,-108,-118), ie=(110,110,-2)
split direction: x, 55
side_low (dz):, is=(-108,-108,-121), ie=(2,112,-1)
side_high (dz):, is=(4,-106,-120), ie=(110,110,-2)

@oskooi oskooi force-pushed the cache_oblivious_step_curl branch from 5609ce8 to f386f55 Compare July 27, 2021 01:10
@oskooi oskooi changed the title WIP: loop tiling for step_curl using cache-oblivious recursive algorithm loop tiling for step_curl using cache-oblivious recursive algorithm Jul 27, 2021
@oskooi
Copy link
Collaborator Author

oskooi commented Jul 27, 2021

After rebasing from the master branch following #1703, this feature seems to be working now: the chunks are divided into tiles and produce the same results as no tiles. Some experimenting is required to determine a reasonable value for the base case of the recursive bisection on this line which has been chosen somewhat arbitrarily to test that the feature is working but not necessarily to provide any kind of performance enhancement.

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 27, 2021

Here are preliminary benchmarking results for a single-threaded 3d OLED simulation consisting of two chunks (PML and non-PML regions). The plot shows the time spent timestepping vs. number of tiles (total for the two chunks). It seems that the no-tiled case is always faster than any of the tiled cases. Meep is compiled with GCC and single-precision fields.

oled_tiling

@stevengj
Copy link
Collaborator

stevengj commented Jul 27, 2021

Some comments:

  1. To see a benefit from tiling a single component, you need a planar cross-section of the chunk to exceed a cache size, at least the L1 cache for a given 3-component vector field. i.e. (diameter*resolution)^2 * 3 * 4bytes/float > cache, roughly.
  2. Since you are only optimizing step_curl, you can strip out any extraneous computations in the benchmark. No DFTs, no Lorentzian dispersive materials (just vacuum is fine). Just run for 1 timestep to initialize things and then benchmark a few timesteps after that — it shouldn't matter for speed whether the fields are nonzero.
  3. A bigger benefit should occur if you swap the loops so that the loop over tiles goes outside the loop over components. In this case you should see a benefit if the whole chunk exceeds the cache: roughly (diameter*resolution)^3 * 3 * 4bytes/float > cache. (I would test swapping the loops after fixing the current benchmarks though, just so that we see the effects of swapping the loops separately.)

Also, I wouldn't divide the chunk into cubes. To maximize spatial locality (consecutive memory axis), since we use row major order I would preferentially divide along x, then y, then z. i.e. something like:

if (ntot < threshold)
    don't divide
else if (x size > 1)
    divide along x
else if (y size > 1)
    divide along y
else
    divide along z

This might not be optimal, because you pay a price in temporal locality, but you can experiment with different thresholds on the different dimensions (i.e. change > 1 to > n) to explore the tradeoffs.

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2021

Codecov Report

Merging #1655 (5a4d858) into master (3f36968) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1655      +/-   ##
==========================================
+ Coverage   73.34%   73.36%   +0.01%     
==========================================
  Files          13       13              
  Lines        4528     4531       +3     
==========================================
+ Hits         3321     3324       +3     
  Misses       1207     1207              
Impacted Files Coverage Δ
python/simulation.py 76.26% <100.00%> (+0.01%) ⬆️
python/adjoint/optimization_problem.py 69.58% <0.00%> (+0.23%) ⬆️

} else {
best_split_point = nz() / 2;
best_split_direction = Z;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth also trying to split along the longest axis.

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 31, 2021

I reran the benchmarking test on my local i7-7700K machine which has much smaller cache than the machine used previously. I simplified the benchmarking test by removing extraneous computations following the suggestions above. I ran this benchmark five separate times for each of six tile configurations. I timed just the portion spent on step_db(B_stuff) and step_db(B_stuff) of fields::step(). The results demonstrate that the min/mean time decreases until reaching a minimum before increasing monotonically as would be expected. However, the relative improvement in the runtime using the optimal tiling configuration is insignificant: just ~2-3% of the no-tiling configuration. These results were generated for the case in which the loop over tiles is outside the loop over components.

To verify whether it is possible to demonstrate a more significant improvement, we can try (1) increasing the problem size and (2) explore other tiling configurations beyond recursive bisection along x, y, and z (i.e., splitting along the longest axis).

$ lscpu
Architecture:         x86_64
Thread(s) per core:   1
Model name:           Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
L1d cache:            32K
L1i cache:            32K
L2 cache:             256K
L3 cache:             8192K

oled_tiles_benchmark

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 31, 2021

Update: I forgot to include the results for the no-tiling case in the previous figure. (The previous figure shows the smallest tiling configuration with 3520 tiles.) With these updated results, the relative improvement is actually 18.2% (minimum runtime of 1.51403 s for no tiling vs. 1.23869 s for 7040 tiles) which is obviously significant. It seems therefore that this cache-oblivious loop tiling approach is indeed working to improve the runtime performance.

oled_tiles_benchmark

@stevengj
Copy link
Collaborator

Remember @matteo-frigo's suggestion — if you run on only a single core of your machine, you might be getting an unusually large memory bandwidth and hence not see benefits from cache optimization. Try running on all of your cores at once — simply do

meep.divide_parallel_processes(meep.count_processors())

to run the identical (serial) simulation on all processors, and use mpirun to launch on all cores.

@oskooi
Copy link
Collaborator Author

oskooi commented Aug 1, 2021

edit: these results should be ignored due to inaccuracies in the benchmarking tests. see the results in the next comment.

I tried running the same benchmark on all four single-threaded cores simultaneously using divide_parallel_processes and the results are actually not that different than running on just a single core. Processor 1 and 3 have slightly larger (minimum) times for the no-tiling case and also show less relative improvement at the optimal tiling configuration whereas processor 2 and 4 show results similar to the single-core case from above:

processor 1 relative improvement: 10.41%
processor 2 relative improvement: 18.23%
processor 3 relative improvement:  1.53%
processor 4 relative improvement: 16.86%

The error bars of the averaged data are noticeably larger than the single-core case but nevertheless in general it still seems that there is a benefit from using loop tiling.

oled_tiles_benchmark_procs

@oskooi
Copy link
Collaborator Author

oskooi commented Aug 1, 2021

Update: there were some inaccuracies in the results posted earlier which should therefore be ignored.

I reran the benchmarking test for 1, 2, 3, and 4 processors. The results show that the relative improvement in the time for step_db increases with the number of processors. (The noisiness of the data also seems to increase with the number of processors.) I think this validates @matteo-frigo's suggestion that the benefits of cache optimization increase with reduced memory bandwidth.

1 processor

processor 1: relative improvement of 17.26% using 7040 tiles

oled_tiles_benchmark_procs1

2 Processors

processor 1: relative improvement of 43.43% using 56320 tiles
processor 2: relative improvement of 41.87% using 14080 tiles

oled_tiles_benchmark_procs2

3 processors

processor 1: relative improvement of 62.51% using 14080 tiles
processor 2: relative improvement of 39.81% using 3520 tiles
processor 3: relative improvement of 50.10% using 3520 tiles

oled_tiles_benchmark_procs3

4 processors

processor 1: relative improvement of 68.03% using 3520 tiles
processor 2: relative improvement of 63.72% using 14080 tiles
processor 3: relative improvement of 62.57% using 14080 tiles
processor 4: relative improvement of 53.73% using 3520 tiles

oled_tiles_benchmark_procs4b

@oskooi oskooi force-pushed the cache_oblivious_step_curl branch from cef5468 to 5a4d858 Compare August 1, 2021 20:54
@oskooi
Copy link
Collaborator Author

oskooi commented Aug 1, 2021

Added an undocumented parameter loop_tile_base for specifying the base case of the recursive-bisection algorithm in split_into_tiles to the Simulation constructor. The default is 0 which indicates no tiling. This API enables the user to experiment with different values at runtime rather than having to every time rebuild from source after updating a hard-coded value.

@matteo-frigo
Copy link

@oskooi You are still using the i7-7700, right? And this is a 2D or (2+epsilon)D problem, right?

@oskooi
Copy link
Collaborator Author

oskooi commented Aug 3, 2021

You are still using the i7-7700, right? And this is a 2D or (2+epsilon)D problem, right?

Yes, these results are generated using i7-7700 and a 3d simulation. The cell size is 11 x 11 x 2.2 with a resolution of 80 pixels per unit length. The total size of the E (Ex,Ey,Ez) and H (Hx,Hy,Hz) fields used for updating B and D, respectively, during the step-curl operation (shown below) is ~1.64 Gb = [(3 fields x 4 bytes/field) x (11 x 11 x 2.2) x 803] / 1e9.

loop_tiling_step_curl

@stevengj stevengj merged commit f5bd3b2 into NanoComp:master Aug 11, 2021
@stevengj
Copy link
Collaborator

The threshold corresponding to about 15000 tiles seems good.

@stevengj
Copy link
Collaborator

@matteo-frigo, you mentioned that at one point (prior to your later stencil work), you tried a simple recursive tiling strategy similar to the one we are using here, and got some speedup. Do you have a reference for that?

@oskooi oskooi deleted the cache_oblivious_step_curl branch August 11, 2021 03:10
@matteo-frigo
Copy link

@stevengj No reference, sorry.

I actually tried as part of the ICS05 paper that you link, not prior to it. In the ICS05 paper we glossed over the base case because ICS05 is mostly a theory paper and performing divide and conquer in the DeltaT=1 case only buys you a constant factor independent of the cache size, so it's kind of an "uninteresting" "detail". For the same reason, in Figure 6, we didn't complicate the code by attempting to cut the largest dimension first. (The cache oblivious matrix multiplication in FOCS99 cuts the largest dimension first, so that's kind of the "obvious" thing to try.)

If you are thinking of writing up the idea and giving proper credit, I don't really need or care about the credit, and I doubt that Volker cares either. I am happy that the idea works. Analyzing the exact conditions under which cutting the largest dimension is a good idea sounds like an interesting undergraduate/master project (how much are you saving depending on the shape of the stencil, how many rows/planes have to fit in cache for this to work, etc.). We never went that far.

bencbartlett pushed a commit to bencbartlett/meep that referenced this pull request Sep 9, 2021
…anoComp#1655)

* loop tiling for step_curl using cache-oblivious recursive algorithm

* minor fixes

* check that the sum of all the tiles adds up to the chunk volume

* switch from subdomain to gv when calling STEP_CURL

* replace manual computation of pixels in grid_volume with private function call

* lower mininum tile volume

* move loop over tiles outside of loop over components and use new tile_split function

* add undocumented parameter for specifying base case of recursion to Simulation constructor
mawc2019 pushed a commit to mawc2019/meep that referenced this pull request Nov 3, 2021
…anoComp#1655)

* loop tiling for step_curl using cache-oblivious recursive algorithm

* minor fixes

* check that the sum of all the tiles adds up to the chunk volume

* switch from subdomain to gv when calling STEP_CURL

* replace manual computation of pixels in grid_volume with private function call

* lower mininum tile volume

* move loop over tiles outside of loop over components and use new tile_split function

* add undocumented parameter for specifying base case of recursion to Simulation constructor
@kannesito
Copy link

Is this (see papers) the direction where you are going with the cache-oblivious recursive algorithm?
https://ieeexplore.ieee.org/abstract/document/7481533
https://ieeexplore.ieee.org/abstract/document/7228423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants