Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task ringbuffer #112

Merged
merged 8 commits into from
Jul 18, 2022
Merged

Task ringbuffer #112

merged 8 commits into from
Jul 18, 2022

Conversation

PeterTh
Copy link
Contributor

@PeterTh PeterTh commented Apr 1, 2022

This is a performance refactoring replacing the unordered_map in the task_manager with a ringbuffer.

Far more importantly than the difference in data structure (though that is also measurable) this completely eliminates any locking contention for accessing tasks by id.

It also introduces a new microbenchmark for task handling, in 2 variants, one with pure task handling, and one in a more realistic scenario with another thread accessing tasks:

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
------------------------------------------------------------------------------- Master
generating and deleting tasks                  100             1    590.461 ms 
                                        5.74591 ms    5.67021 ms    5.80049 ms 
                                        324.677 us    251.264 us    397.061 us 
generating and deleting tasks with                                             
access thread                                  100             1     3.78352 s 
                                        37.0397 ms    35.3406 ms    38.6955 ms 
                                        8.59095 ms    7.65626 ms    9.70456 ms 
------------------------------------------------------------------------------- This Branch
generating and deleting tasks                  100             1    464.589 ms 
                                        4.56318 ms    4.50633 ms    4.60211 ms 
                                        237.555 us    174.134 us    308.134 us 
generating and deleting tasks with                                             
access thread                                  100             1    878.879 ms 
                                        8.39569 ms    8.29195 ms    8.48651 ms 
                                        495.973 us    419.964 us    580.127 us 

Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting! My major concern with the implementation is, however, that an allocation-free implementation like this is closely tied to stalling the main thread on submission, but that goal is only partially addressed by spin-waiting inside reserve_new_tid (We explicitly want to throttle submission to avoid overwhelming MPI as discussed earlier).

Moving task deletion to the scheduler thread (inside notify...) unnecessarily introduces a second mutator to the ring buffer. We should really merge #86 first, which gives us the epoch_monitor that allows the task manager to block on a horizon being executed (and becoming the new effective epoch) if it would otherwise overrun the ring buffer. reserve_new_tid would then return an optional, and wait_for_available_slot becomes unnecessary.

Also, a couple of smaller things:

  • task_ring_buffer should rigorously document which member functions can be invoked only by the producer thread (main / task manager) and which ones are thread-safe.
  • I don't believe we need to separate next_active_tid from next_task_id if we extend has_task to check for != nullptr on the task pointer. Then we can also make the API of emplace() less restrictive. Edit: On second thought, that is not true, since we need the release-store on next_active_id to make the non-atomic insertion into the unique_ptr visible.
  • Once the task manager thread becomes the only mutator, number_of_deleted_tasks does not need to be atomic any more. Also, all atomic accesses can be either acquire or release, no need for seq_cst anywhere.

include/task_ring_buffer.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
@PeterTh PeterTh self-assigned this Apr 4, 2022
@PeterTh PeterTh force-pushed the task-ringbuffer branch 6 times, most recently from d0df1b2 to fec38b9 Compare May 19, 2022 11:51
@PeterTh
Copy link
Contributor Author

PeterTh commented Jun 17, 2022

This latest version now uses the epoch monitor for synchronization. The performance impact of that compared to spinning is measurable but seems ok:

comp

There are now also unit tests for the borderline scenarios of the ringbuffer (running out of buffer space due to long-running active tasks but eventually recovering; and running into a situation which would deadlock).

@PeterTh PeterTh requested a review from fknorr June 17, 2022 13:23
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some general remarks in addition to the line comments:

  • I would perfer to avoid implicitly-seq_cst atomic operations in favor of those with explicit memory ordering. Even though the former is always correct, it keeps the reader in the dark about when atomics are used to synchronize other memory accesses. E.g. most loads in the task manager thread can be relaxed.
  • task_ring_buffer should rigorously document which member functions can be invoked only by the producer thread (main / task manager) and which ones are thread-safe.

include/task_manager.h Outdated Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
include/task_ring_buffer.h Outdated Show resolved Hide resolved
src/task_manager.cc Outdated Show resolved Hide resolved
src/task_manager.cc Outdated Show resolved Hide resolved
src/task_manager.cc Outdated Show resolved Hide resolved
test/task_ring_buffer_tests.cc Outdated Show resolved Hide resolved
src/task_manager.cc Show resolved Hide resolved
@PeterTh PeterTh force-pushed the task-ringbuffer branch 2 times, most recently from f6ee72d to c7def2d Compare July 18, 2022 10:27
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

include/task_manager.h Show resolved Hide resolved
src/task_manager.cc Show resolved Hide resolved
test/task_ring_buffer_tests.cc Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@PeterTh PeterTh merged commit 5139256 into master Jul 18, 2022
@PeterTh PeterTh deleted the task-ringbuffer branch July 18, 2022 14:42
@fknorr fknorr mentioned this pull request Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants