-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loop tiling for step_curl using cache-oblivious recursive algorithm #1655
Conversation
Should I merge #1653 first? That PR should be harmless in any case. |
2263ac8
to
304a402
Compare
See lectures 14 and 15 in these OCW lectures. This is about accessing the memory in the right order (improving "temporal locality"), and once you do that the cache hardware should do the rest. (The base case of the recursion has nothing to do with cache — it is "recursion coarsening" to amortize the function-call overhead. You want to shrink the base case as much as possible until you start seeing performance degrade, which will eventually happen due to function-call overhead.) Note also that this is only about improving serial (non-parallel) performance. |
Even with the fix inside |
For some reason, the CI tests seem to all be passing even though they are still failing on my local build (as I reported previously). |
Based on some experiments, this PR seems to only work if the minimum tile volume |
I have tried just splitting the chunk in two along its longest direction but it still doesn't seem to work (i.e., some of the unit tests are failing). master STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
gv, gv.little_owned_corner0(cc), gv.big_corner(),
Courant, dsig, s->sig[dsig],
s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
f_cond[cc][cmp]); proposed change direction longest_axis = NO_DIRECTION;
int num_in_longest_axis = 0;
LOOP_OVER_DIRECTIONS(gv.dim, d) {
if (gv.num_direction(d) > num_in_longest_axis) {
longest_axis = d;
num_in_longest_axis = gv.num_direction(d);
}
}
int split_pt = gv.num_direction(longest_axis) / 2;
grid_volume side_low;
side_low.dim = gv.dim;
side_low.a = gv.a;
LOOP_OVER_DIRECTIONS(gv.dim, d) {
side_low.set_num_direction(d, gv.num_direction(d));
}
side_low.set_origin(gv.little_owned_corner0(cc));
side_low.set_num_direction(longest_axis, split_pt);
grid_volume side_high;
side_high.dim = gv.dim;
side_high.a = gv.a;
LOOP_OVER_DIRECTIONS(gv.dim, d) {
side_high.set_num_direction(d, gv.num_direction(d));
}
side_high.set_origin(gv.little_owned_corner0(cc));
side_high.shift_origin(longest_axis, split_pt * 2);
side_high.set_num_direction(longest_axis, side_high.num_direction(longest_axis) - split_pt);
ivec is = gv.little_owned_corner0(cc);
ivec ie = min(side_low.big_corner() + gv.iyee_shift(cc), side_high.little_owned_corner0(cc));
STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
gv, is, ie,
Courant, dsig, s->sig[dsig],
s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
f_cond[cc][cmp]);
is = max(side_low.big_corner() + gv.iyee_shift(cc), side_high.little_owned_corner0(cc));
ie = gv.big_corner();
STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
gv, is, ie,
Courant, dsig, s->sig[dsig],
s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
f_cond[cc][cmp]); |
We really should have a copy constructor for |
Would be good to check that the |
Outputting the master_printf("split direction: %s, %d\n",direction_name(longest_axis),split_pt);
ivec is = gv.little_owned_corner0(cc);
ivec ie = side_low.big_corner();
master_printf("side_low (%s):, is=(%d,%d,%d), ie=(%d,%d,%d)\n",
component_name(cc),
is.x(),is.y(),is.z(),ie.x(),ie.y(),ie.z());
STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
gv, is, ie,
Courant, dsig, s->sig[dsig],
s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
f_cond[cc][cmp]);
is = side_high.little_owned_corner0(cc);
ie = gv.big_corner();
master_printf("side_high (%s):, is=(%d,%d,%d), ie=(%d,%d,%d)\n",
component_name(cc),
is.x(),is.y(),is.z(),ie.x(),ie.y(),ie.z());
STEP_CURL(the_f, cc, f_p, f_m, stride_p, stride_m,
gv, is, ie,
Courant, dsig, s->sig[dsig],
s->kap[dsig], s->siginv[dsig], f_u[cc][cmp], dsigu, s->sig[dsigu], s->kap[dsigu],
s->siginv[dsigu], dt, s->conductivity[cc][d_c], s->condinv[cc][d_c],
f_cond[cc][cmp]); output for a 3d test problem
|
5609ce8
to
f386f55
Compare
After rebasing from the master branch following #1703, this feature seems to be working now: the chunks are divided into tiles and produce the same results as no tiles. Some experimenting is required to determine a reasonable value for the base case of the recursive bisection on this line which has been chosen somewhat arbitrarily to test that the feature is working but not necessarily to provide any kind of performance enhancement. |
Here are preliminary benchmarking results for a single-threaded 3d OLED simulation consisting of two chunks (PML and non-PML regions). The plot shows the time spent timestepping vs. number of tiles (total for the two chunks). It seems that the no-tiled case is always faster than any of the tiled cases. Meep is compiled with GCC and single-precision fields. |
Some comments:
Also, I wouldn't divide the chunk into cubes. To maximize spatial locality (consecutive memory axis), since we use row major order I would preferentially divide along x, then y, then z. i.e. something like:
This might not be optimal, because you pay a price in temporal locality, but you can experiment with different thresholds on the different dimensions (i.e. change |
Codecov Report
@@ Coverage Diff @@
## master #1655 +/- ##
==========================================
+ Coverage 73.34% 73.36% +0.01%
==========================================
Files 13 13
Lines 4528 4531 +3
==========================================
+ Hits 3321 3324 +3
Misses 1207 1207
|
} else { | ||
best_split_point = nz() / 2; | ||
best_split_direction = Z; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worth also trying to split along the longest axis.
I reran the benchmarking test on my local i7-7700K machine which has much smaller cache than the machine used previously. I simplified the benchmarking test by removing extraneous computations following the suggestions above. I ran this benchmark five separate times for each of six tile configurations. I timed just the portion spent on To verify whether it is possible to demonstrate a more significant improvement, we can try (1) increasing the problem size and (2) explore other tiling configurations beyond recursive bisection along x, y, and z (i.e., splitting along the longest axis).
|
Update: I forgot to include the results for the no-tiling case in the previous figure. (The previous figure shows the smallest tiling configuration with 3520 tiles.) With these updated results, the relative improvement is actually 18.2% (minimum runtime of 1.51403 s for no tiling vs. 1.23869 s for 7040 tiles) which is obviously significant. It seems therefore that this cache-oblivious loop tiling approach is indeed working to improve the runtime performance. |
Remember @matteo-frigo's suggestion — if you run on only a single core of your machine, you might be getting an unusually large memory bandwidth and hence not see benefits from cache optimization. Try running on all of your cores at once — simply do meep.divide_parallel_processes(meep.count_processors()) to run the identical (serial) simulation on all processors, and use |
edit: these results should be ignored due to inaccuracies in the benchmarking tests. see the results in the next comment. I tried running the same benchmark on all four single-threaded cores simultaneously using
The error bars of the averaged data are noticeably larger than the single-core case but nevertheless in general it still seems that there is a benefit from using loop tiling. |
Update: there were some inaccuracies in the results posted earlier which should therefore be ignored. I reran the benchmarking test for 1, 2, 3, and 4 processors. The results show that the relative improvement in the time for 1 processor
2 Processors
3 processors
4 processors
|
…imulation constructor
cef5468
to
5a4d858
Compare
Added an undocumented parameter |
@oskooi You are still using the i7-7700, right? And this is a 2D or (2+epsilon)D problem, right? |
Yes, these results are generated using i7-7700 and a 3d simulation. The cell size is 11 x 11 x 2.2 with a resolution of 80 pixels per unit length. The total size of the E (Ex,Ey,Ez) and H (Hx,Hy,Hz) fields used for updating B and D, respectively, during the step-curl operation (shown below) is ~1.64 Gb = [(3 fields x 4 bytes/field) x (11 x 11 x 2.2) x 803] / 1e9. |
The threshold corresponding to about 15000 tiles seems good. |
@matteo-frigo, you mentioned that at one point (prior to your later stencil work), you tried a simple recursive tiling strategy similar to the one we are using here, and got some speedup. Do you have a reference for that? |
@stevengj No reference, sorry. I actually tried as part of the ICS05 paper that you link, not prior to it. In the ICS05 paper we glossed over the base case because ICS05 is mostly a theory paper and performing divide and conquer in the DeltaT=1 case only buys you a constant factor independent of the cache size, so it's kind of an "uninteresting" "detail". For the same reason, in Figure 6, we didn't complicate the code by attempting to cut the largest dimension first. (The cache oblivious matrix multiplication in FOCS99 cuts the largest dimension first, so that's kind of the "obvious" thing to try.) If you are thinking of writing up the idea and giving proper credit, I don't really need or care about the credit, and I doubt that Volker cares either. I am happy that the idea works. Analyzing the exact conditions under which cutting the largest dimension is a good idea sounds like an interesting undergraduate/master project (how much are you saving depending on the shape of the stencil, how many rows/planes have to fit in cache for this to work, etc.). We never went that far. |
…anoComp#1655) * loop tiling for step_curl using cache-oblivious recursive algorithm * minor fixes * check that the sum of all the tiles adds up to the chunk volume * switch from subdomain to gv when calling STEP_CURL * replace manual computation of pixels in grid_volume with private function call * lower mininum tile volume * move loop over tiles outside of loop over components and use new tile_split function * add undocumented parameter for specifying base case of recursion to Simulation constructor
…anoComp#1655) * loop tiling for step_curl using cache-oblivious recursive algorithm * minor fixes * check that the sum of all the tiles adds up to the chunk volume * switch from subdomain to gv when calling STEP_CURL * replace manual computation of pixels in grid_volume with private function call * lower mininum tile volume * move loop over tiles outside of loop over components and use new tile_split function * add undocumented parameter for specifying base case of recursion to Simulation constructor
Is this (see papers) the direction where you are going with the cache-oblivious recursive algorithm? |
Initial (and incomplete) attempt to extend the revamped API for
STEP_CURL
andSTEP_UPDATE_EDHB
introduced in #1653 to support loop tiling using a cache-oblivious recursive algorithm. The focus is to first verify the correctness of the overall approach.This PR introduces a new member
std::vector<grid_volume> gvs
to thefields_chunks
class which stores an array of the subdomains or "tiles" of each chunk generated using a recursive-bisection algorithm implemented insplit_into_tiles
. The nice thing is is that we can reuse the existinggrid_volume
class functionsfind_best_split
andsplit_at_fraction
for the cell-partitioning feature insidesplit_into_tiles
. The tiles are generated once as part of thefields_chunk
constructor although in general they would need to be defined every timechunk_connections_valid = false
. A loop over the tiles is then defined aroundSTEP_CURL
insidefields_chunk::step_db
(a similar approach would apply toSTEP_UPDATE_EDHB
insidefields_chunk::update_eh
). For now, only Cartesian coordinates are supported and we can deal with Cylindrical coordinates separately after this is working.There is a hard-coded value
base_nowned_min
insidesplit_into_tiles
which determines the size of the smallest domain in the recursion. In my preliminary benchmarking results, adjusting this value from100
and1000
changed the average time-stepping rate by ~20% for a test case involving an OLED simulation with 4 MPI processes on a single machine. I will post more results shortly. Unfortunately, this PR is causing some of the unit tests to fail which probably needs to be resolved first.