Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf[BMQ,MQB]: callback construction in a reusable buffer #481

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

678098
Copy link
Collaborator

@678098 678098 commented Oct 27, 2024

Changes

  • bmqu_managedcallback: a new component that introduces a functor holder class that reuses memory buffer.
  • bmqu_managedcallback_cpp03: autogenerated cpp03 compatible version.
  • bmqu_managedcallback.t: add tests and benchmarks.
  • mqbi_dispatcher: get rid of bsl::function usage, replace it with bmqu::ManagedCallback.
  • Remove mqbi::Dispatcher::ProcessorFunctor, pass and execute void functors only.
  • mqbblp, mqbc: remove mqbi::Dispatcher::ProcessorHandle usage from args and callbacks (it was not actually needed).
  • On the fanout30 scenario stable throughput increased by +12.5%.

Problem

Bind with bsl::function has too much overhead. Even if there is a small buffer optimization when binding, copying or moving a binded function to bsl::function (mqbi::DispatcherEvent::d_callback field) causes allocations. It happens for every confirm going to the cluster.

Profiler

Before (in red - everything related to Bind and conversions to d_callback):
Screenshot 2024-10-27 at 05 39 05

After:
Screenshot 2024-10-27 at 05 11 24

This PR was tested on a priority queue, and it saves ~10% of cluster thread processing time. But it should have even greater impact on Fanout queues.

Isolated benchmark

A simple benchmark for both approaches:

class ConfirmFunctor {
  private:
    size_t d_num;

  public:
    // CREATORS
    explicit ConfirmFunctor(size_t num, bslma::Allocator *allocator = 0)
    : d_num(num)
    {
        // NOTHING
    }

    ConfirmFunctor(const ConfirmFunctor&) = default;

    ConfirmFunctor(bslmf::MovableRef<ConfirmFunctor> other) BSLS_NOTHROW_SPEC
    : d_num(other.d_num)
    {
        // NOTHING
    }


    void operator()() {
        if (d_num + 10 == 111) {
            bsl::cout <<  d_num << bsl::endl;
        }
    }
};

struct DataTester {
    bsl::function<void()> d_f1;

    explicit DataTester(bslmf::MovableRef<ConfirmFunctor> f1)
    : d_f1(bslmf::MovableRefUtil::move(f1))
    {

    }

    void test() {
        d_f1();
    }
};

struct DataTester2 {
    ConfirmFunctor d_f1;

    explicit DataTester2(bslmf::MovableRef<ConfirmFunctor> f1)
    : d_f1(bslmf::MovableRefUtil::move(f1))
    {

    }

    void test() {
        d_f1();
    }
};

static void testFunctors(bslma::Allocator *allocator) {
    bsl::cout << bsl::is_nothrow_move_constructible_v<ConfirmFunctor> << bsl::endl;
    bsl::cout << sizeof(ConfirmFunctor) << bsl::endl;
    {
        bsls::Types::Int64 begin = bsls::TimeUtil::getTimer();
        for (size_t i = 0; i < 100000000; i++) {
            DataTester tester(ConfirmFunctor(i, s_allocator_p));
            tester.test();
        }
        bsls::Types::Int64 end = bsls::TimeUtil::getTimer();
        bsl::cout << "dt function: " << bmqu::PrintUtil::prettyTimeInterval(end - begin) << "\n";
    }
    {
        bsls::Types::Int64 begin = bsls::TimeUtil::getTimer();
        for (size_t i = 0; i < 100000000; i++) {
            DataTester2 tester(ConfirmFunctor(i, s_allocator_p));
            tester.test();
        }
        bsls::Types::Int64 end = bsls::TimeUtil::getTimer();
        bsl::cout << "dt in-place: " << bmqu::PrintUtil::prettyTimeInterval(end - begin) << "\n";
    }
}

Outputs:

1
8
101
dt function: 1.43 s
101
dt in-place: 33.96 ms

So functor to bsl::function conversion has 40x overhead in this example.

perf output for this sample code:
Screenshot 2024-10-27 at 05 54 59

@678098 678098 requested a review from a team as a code owner October 27, 2024 05:29
@678098 678098 changed the title [POC]Perf[MQB]: callback allocations in fixed buffer [POC]Perf[MQB]: callback construction in a fixed buffer Oct 27, 2024
@678098 678098 changed the title [POC]Perf[MQB]: callback construction in a fixed buffer Perf[MQB]: callback construction in a fixed buffer Oct 28, 2024
@678098 678098 force-pushed the 241027_callback_opt branch 3 times, most recently from 61e1b87 to db16c77 Compare October 31, 2024 03:46
@678098 678098 force-pushed the 241027_callback_opt branch from db16c77 to 7b8f8b6 Compare December 11, 2024 09:00
Copy link
Collaborator

@pniedzielski pniedzielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. I think main has moved a little since this was created, so once you're able to rebase, I can do a quick second review and approve.

@678098
Copy link
Collaborator Author

678098 commented Dec 19, 2024

@pniedzielski I need to refine this a bit and put performance measures

@678098 678098 force-pushed the 241027_callback_opt branch 4 times, most recently from 62cd35e to 8b62d79 Compare February 19, 2025 11:58
@678098
Copy link
Collaborator Author

678098 commented Feb 19, 2025

Benchmarks

Mac M2

========= Benchmark 1: function call ==========
Build a functor once and call it multiple times
===============================================

bsl::function<...>():
       total: 190.52 ms (10000000 iterations)
    per call: 19 ns

bmqu::ManagedCallback(vector):
       total: 44.82 ms (10000000 iterations)
    per call: 4 ns

bmqu::ManagedCallback(char[]):
       total: 32.82 ms (10000000 iterations)
    per call: 3 ns

========= Benchmark 2: new construct ==========
Build a functor multiple times without calling
===============================================

bsl::function<...>():
       total: 2.66 s (10000000 iterations)
    per call: 266 ns

bmqu::ManagedCallback(vector):
       total: 589.22 ms (10000000 iterations)
    per call: 58 ns

bmqu::ManagedCallback(char[]):
       total: 521.09 ms (10000000 iterations)
    per call: 52 ns

========= Benchmark 3: reuse functor ==========
Reset and call a functor multiple times
===============================================

bsl::function<...>():
       total: 3.65 s (10000000 iterations)
    per call: 364 ns

bmqu::ManagedCallback(vector):
       total: 156.11 ms (10000000 iterations)
    per call: 15 ns

bmqu::ManagedCallback(char[]):
       total: 111.08 ms (10000000 iterations)
    per call: 11 ns

===== Benchmark 4: reuse complex functor ======
Reset and call a complex functor multiple times
===============================================

bsl::function<...>():
       total: 9.09 s (10000000 iterations)
    per call: 908 ns

bmqu::ManagedCallback(vector):
       total: 277.53 ms (10000000 iterations)
    per call: 27 ns

bmqu::ManagedCallback(char[]):
       total: 231.95 ms (10000000 iterations)
    per call: 23 ns

Linux (amd64)

========= Benchmark 1: function call ==========
Build a functor once and call it multiple times
===============================================

bsl::function<...>():
       total: 26.77 ms (10000000 iterations)
    per call: 2 ns

bmqu::ManagedCallback(vector):
       total: 20.94 ms (10000000 iterations)
    per call: 2 ns

bmqu::ManagedCallback(char[]):
       total: 20.06 ms (10000000 iterations)
    per call: 2 ns

========= Benchmark 2: new construct ==========
Build a functor multiple times without calling
===============================================

bsl::function<...>():
       total: 1.78 s (10000000 iterations)
    per call: 177 ns

bmqu::ManagedCallback(vector):
       total: 1.92 s (10000000 iterations)
    per call: 191 ns

bmqu::ManagedCallback(char[]):
       total: 994.72 ms (10000000 iterations)
    per call: 99 ns

========= Benchmark 3: reuse functor ==========
Reset and call a functor multiple times
===============================================

bsl::function<...>():
       total: 2.10 s (10000000 iterations)
    per call: 210 ns

bmqu::ManagedCallback(vector):
       total: 65.19 ms (10000000 iterations)
    per call: 6 ns

bmqu::ManagedCallback(char[]):
       total: 36.80 ms (10000000 iterations)
    per call: 3 ns

===== Benchmark 4: reuse complex functor ======
Reset and call a complex functor multiple times
===============================================

bsl::function<...>():
       total: 2.69 s (10000000 iterations)
    per call: 268 ns

bmqu::ManagedCallback(vector):
       total: 105.49 ms (10000000 iterations)
    per call: 10 ns

bmqu::ManagedCallback(char[]):
       total: 63.89 ms (10000000 iterations)
    per call: 6 ns

From benchmark 1, there is a ~20% speed-up on repeatedly calling a functor when using managed callback.

Benchmark 2 shows that building new managed callbacks is 8-10% slower than building bsl::function with one argument. It happens because we reallocate memory all the time. In the real usage scenario we should always reuse the pre-built managed callbacks, so this degradation will not be observable.

Benchmarks 3 & 4 show the real usage scenario in mqbi::DispatcherEvent, and on these scenarios ManagedCallback(vector) outperforms bsl::function by 26-35 times. ManagedCallback(char[]) is even faster, but its usage is more error-prone now, because we don't have static asserts to ensure that the callback fits the internal buffer all the time.

Mac benchmarks are not that interesting compared to amd64, but they still show a good speed up on a real scenario.

@678098 678098 force-pushed the 241027_callback_opt branch 2 times, most recently from 4ef7577 to 6cdaa03 Compare February 19, 2025 12:41
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2481 of commit 6cdaa03 has completed with FAILURE

@678098
Copy link
Collaborator Author

678098 commented Feb 19, 2025

Screenshot 2025-02-19 at 17 00 30

The top 3 runs are with the changes in this PR, the bottom one is the highest stable produce rate before, highlighted in the benchmarks blog post. Comparing the results on a fanout30 domain, the confirm optimization gives us an extra +1k msgs/s PUT rate on the same scenario (fanout PUSH rate +30k msgs/s).

@678098 678098 force-pushed the 241027_callback_opt branch from 6cdaa03 to c9e0578 Compare February 20, 2025 12:04
/// `CallbackFunctor` type.
/// TODO: replace by static_assert on C++ standard update
BSLS_ASSERT_SAFE(0 == static_cast<CALLBACK_TYPE*>(
reinterpret_cast<CallbackFunctor*>(0)));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile time error if we try to pass a template class that is not a child class for CallbackFunctor:

/blazingmq/src/groups/mqb/mqbi/mqbi_dispatcher.h:537:31: error: static_cast from 'CallbackFunctor *' to '(anonymous namespace)::Dummy *', which are not related by inheritance, is not allowed
  537 |         BSLS_ASSERT_SAFE(0 == static_cast<CALLBACK_TYPE*>(
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
  538 |                                   reinterpret_cast<CallbackFunctor*>(0)));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@678098 678098 force-pushed the 241027_callback_opt branch 2 times, most recently from 6bdb4a6 to 8731c30 Compare February 20, 2025 12:56
@678098 678098 changed the title Perf[MQB]: callback construction in a fixed buffer Perf[MQB]: callback construction in a reusable buffer Feb 20, 2025
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2489 of commit 8731c30 has completed with FAILURE

@678098 678098 force-pushed the 241027_callback_opt branch from 8731c30 to abea187 Compare February 20, 2025 15:29
private:
// DATA
/// Reusable buffer holding the stored callback.
bsl::vector<char> d_callbackBuffer;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if the object stored in this buffer is not trivially copyable, we cannot simply copy ManagedCallback objects. For this reason I removed copy/assignment operators. There is an alternative to store this d_callbackBuffer under a shared_ptr, but I don't want to introduce this overhead unless we have a good reason to

@pniedzielski pniedzielski self-assigned this Feb 20, 2025
@678098 678098 force-pushed the 241027_callback_opt branch from abea187 to e357ec4 Compare February 20, 2025 16:19
@678098 678098 changed the title Perf[MQB]: callback construction in a reusable buffer Perf[BMQ,MQB]: callback construction in a reusable buffer Feb 20, 2025
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2491 of commit e357ec4 has completed with FAILURE

@678098 678098 force-pushed the 241027_callback_opt branch from e357ec4 to 00f7be9 Compare February 20, 2025 18:24
Signed-off-by: Evgeny Malygin <emalygin@bloomberg.net>
@678098 678098 force-pushed the 241027_callback_opt branch from 00f7be9 to a127128 Compare February 20, 2025 18:34
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2495 of commit a127128 has completed with FAILURE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants