-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid comm/compute overlap for short-lived tasks #12964
Avoid comm/compute overlap for short-lived tasks #12964
Conversation
As we've discussed previously, this still seems like an awfully indirect way to do what we want, which is basically to throttle task creation as a side effect of yields. I'd rather we do this by adding a throttle to Qthreads. But I'm resigned to my fate here. That said, is there a strong reason not to do this directly in |
I agree this is weird, but I think it's pretty simple for now and puts us in a better state. I'll open up a new issue for a better long-term fix. For a qthreads throttle -- I'm not quite sure what that should look like just yet. For the case of oversubscribed ra-atomics we do want to task yield so we can get overlap and it's possible not all tasks will be running so a
Hnm, yeah that's an interesting/good idea. Let me think about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Moving this logic to a yield wrapper is captured in https://github.com/Cray/chapel-private/issues/209. Offline we agreed to go with this to start and make improvements later. |
Add a test that creates thousands of remote tasks (more task stacks than there is physical memory for if they actually ran concurrently.) I believe this is a test we want to work under qthreads since there is no user-induced task yields. It currently fails under both ugni and gasnet because of task yielding in the comm layers.
They should be relatively quick (and are currently used for end count manipulation, which can artificially increase the lifetime of tasks) so avoid yielding.
Avoid comm/compute overlap for FMA in ungi for tasks that haven't issued at least 100 FMA. This is a little bit odd, but it prevents us from doing comm/compute overlap for tasks that haven't done much comm. This allow us to run an arbitrary number of tasks with no user-induced yields, but still get comm/compute overlap for RA. Here we use task-local-storage to track the amount of comm. Accessing task-local storage is pretty cheap (~ half the cost of a processor atomic) so this adds virtually no overhead.
13ed38b
to
f504fc6
Compare
Avoid more comm/compute overlap in ugni [reviewed by @gbtitus] In #12964 we eliminated comm/compute overlap for FMA operations when a task hadn't done ~100 FMA operations. Here we extend that to RDMA and chained FMA operations as well. This is motivated by aggregation work where we do on-stmts followed by large RDMA PUTs/GETs, but we really don't want too many tasks started at once since that increases memory pressure. This also adds an option to completely disable comm/compute overlap in ugni, but it's not enabled by default since it has a 2x performance hit for SSCA.
Adjust remote cache to better handle task yields in comm events Before this PR, the `--cache-remote` cache has used a "lock" to ensure that only a single cache enters the cache API functions. This PR adjusts it to instead "lock" at a cache entry level. (Here "lock" means to check a variable indicating it is in use; to loop and yield until it is 0; and then set it. This is similar to how a lock works but in this case it is only managing access from other tasks as a result of `chpl_task_yield`). Details: * Added the `entryReservedByTask` to the cache entries. Adjusted `flush_entry`, `cache_get`, and `cache_put` to take this "lock". * Passed `task_local` to many functions since it is used to know if the current task has the "lock" on an entry. This allows the code to be able to know if the lock is already taken by the current task. This is primarily used in assertion checking right now. * Factored per-page operations out of put/get/invalidate * Added "lock" code around per-page operations -- put/get/invalidate as well as evictions take the lock and release it when completed * Adjusted the code waiting for nonblocking comm operations (`do_wait_for`) to better tolerate task yields * Took care with locking another entry while an entry is locked. This kind of thing can cause deadlock. The place it was present was in the `cache_get` code - it was making sure that there was a freed page while waiting for the nonblocking GET. Since this is a performance hint, I adjusted it to just give up if the page needing to be evicted is locked. * Automatic readahead when doing GETs does not lock multiple entries because it only happens on a cache hit - so it unlocks the entry with the data before starting the prefetching (which will lock each page involved, one at a time). * Added `EXTRA_YIELD` setting for debugging which does yielding around comms events. This allowed me to debug the issues with yielding using a local gasnet configuration. * Added more debug printing/tracing that I used during debugging. Additionally, I noticed that the strided get/put calls were invalidating the entire cache but now we have functions to invalidate just a region. At the same time, we have strided "common helper" functions that enumerate the contiguous regions in runtime/include/chpl-comm-strd-xfer.h. So, I updated the strided get/put support in the cache to only invalidate the portions updated by the operation, instead of the entire cache. To do so, I needed to add another variant of the code in chpl-comm-strd-xfer.h since the other functions there call `comm_get` / `comm_put` and this one needs to invalidate. We can change the setting `STRIDED_INVALIDATE_ALL` to check the performance of this change independently. Related to PRs #15266, #16020 and issues Cray/chapel-private#585 and Cray/chapel-private#879. PR #12964 talks a bit about why comm-compute overlap changes are important in comm performance. This PR improves performance for several Arkouda operations when running with `--cache-remote`. The reason it improves performance is that the previous coarse-grained locking strategy adding yields without really making progress on communication in oversubscribed settings. Tasks would be created to try to do comms but not be able to make progress. Reviewed by @gbtitus - thanks! - [x] test/runtime/configMatters/comm/cache-remote and test/optimizations/cache-remote pass with `VERIFY` and `EXTRA_YIELDS` enabled (except for `ra_prefetch.chpl` which times out due to problem size too large for this configuration but functions with a smaller problem size) - [x] full local gasnet testing - [x] full local gasnet testing with `--cache-remote`
Avoid comm/compute overlap for short-lived (minimal comm) tasks.
Comm/compute overlap can artificially increase the lifetime of tasks
because we yield, put a task at the end of the queue, and will have to
cycle through all currently running/queued tasks before getting back to
it. This can lead to OOM situations when you have a lot of short-lived
tasks with no user-induced yielding. This can occur pretty easily today
because things like reductions will create numLocales tasks on locale 0.
These tasks are short-lived and don't have any user-induced yields, so
the OOM is pretty extreme.
Here we eliminate comm/compute overlap for unregistered puts/gets and
fast-ons in gasnet. We avoid doing comm/compute overlap for FMA (short
gets/puts and AMOs) under ugni unless a task has issues at least 100 FMA
operations. This allows us to get comm/compute overlap in cases where it
really matters for performance (many FMA request and large BTE comm)
without increasing the lifetime of short-lived tasks. We use
task-local-storage to track the number of FMA requests issued. This is
fast (~half the cost of an atomic operation) so the overhead is
negligible and still allows us to get good performance for
oversubscribed RA-atomics.
This isn't a perfect solution and primarily avoids task yields for
endcount manipulation, but I think this is better than what we had
before. A better solution is probably to decrease the size of our task
stacks (and/or avoid registering task stacks and keep them on
non-hugepages so we could limit the amount of physical memory used for
them.)
This should allow us to remove all our execenvs that reduce the task
stack size for multi-locale and scalability testing.
Closes #12874