Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription check) upon subscription #19343

Merged

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Jan 28, 2023

Motivation

Related to #19341,

  • When there are many concurrent subscriptions in a topic(10k> ), brokers' performance degrades because checkBackloggedCursors() iterates all subscriptions for each subscribe call.
"pulsar-io-4-2" #77 prio=5 os_prio=0 cpu=17289.90ms elapsed=170.09s tid=0x00007f85b0085800 nid=0xa2 runnable  [0x00007f859d05e000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
        - locked <0x00007f8921b24c58> (a org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
        at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown Source)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)

"pulsar-io-4-3" #78 prio=5 os_prio=0 cpu=17224.94ms elapsed=170.09s tid=0x00007f85b0087000 nid=0xa3 waiting for monitor entry  [0x00007f859cf5d000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
        - waiting to lock <0x00007f8921b24c58> (a org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
        at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown Source)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
        at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
        at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)
  • Running checkBackloggedCursors() in the subscription path is redundant as checkBackloggedCursors() is called in updateStats() every statsUpdateFrequencyInSecs(by default 1min).

Modifications

Replaced checkBackloggedCursors() with checkBackloggedCursor() in subscribe() call. Upon the subscription creation, this will only check the backloggedCursor from the subscription to create instead of checking all the subscriptions.

Verifying this change

  • Make sure that the change passes the CI checks.

This change can be verified as follows:

simulate many concurrent subscriptions

#!/bin/sh

while :
do
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest -time 3 -ss a &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss b &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss c &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss d &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss e &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss f &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss g &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss h &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss i &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss j &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss k &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss l &
   bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp Earliest  -time 3 -ss m &

   for job in `jobs -p`
     do
       echo $job
       wait $job
     done
done

before this PR

bin/pulsar-perf produce my-topic -r 10 -m 1000
2023-01-27T18:47:30,430-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:       0 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:47:40,510-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---     11.8 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct: 12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - Max: 12728.447
2023-01-27T18:47:50,531-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:00,555-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:10,583-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:20,616-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:30,656-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:40,683-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:48:50,711-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:49:00,734-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     119 msg ---      0.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
2023-01-27T18:49:01,300-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached the maximum number: 1000 of production) --------------
2023-01-27T18:49:06,306-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 119 records sent --- 1.123 msg/s --- 0.009 Mbit/s
2023-01-27T18:49:06,337-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct: 12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - 99.999pct: 12728.447 - Max: 12728.447

after this PR

bin/pulsar-perf produce my-topic -r 10 -m 1000
2023-01-28T17:10:28,675-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - Created 1 producers
2023-01-28T17:10:37,763-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      30 msg ---      3.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 3038.212 ms - med: 2437.903 - 95pct: 5232.415 - 99pct: 5752.511 - 99.9pct: 5752.511 - 99.99pct: 5752.511 - Max: 5752.511
2023-01-28T17:10:47,794-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      65 msg ---      3.5 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 9431.738 ms - med: 9554.751 - 95pct: 12154.943 - 99pct: 12368.895 - 99.9pct: 12368.895 - 99.99pct: 12368.895 - Max: 12368.895
2023-01-28T17:10:57,822-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      95 msg ---      3.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 15943.402 ms - med: 15274.559 - 95pct: 19353.727 - 99pct: 19418.239 - 99.9pct: 19418.239 - 99.99pct: 19418.239 - Max: 19418.239
2023-01-28T17:11:07,847-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     129 msg ---      3.4 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 22438.837 ms - med: 21504.895 - 95pct: 25598.207 - 99pct: 26323.839 - 99.9pct: 26323.839 - 99.99pct: 26323.839 - Max: 26323.839
2023-01-28T17:11:17,869-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     169 msg ---      4.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 30270.118 ms - med: 30489.343 - 95pct: 32337.919 - 99pct: 32541.439 - 99.9pct: 32541.439 - 99.99pct: 32541.439 - Max: 32541.439
2023-01-28T17:11:27,905-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     219 msg ---      5.0 msg/s ---      0.0 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 34567.780 ms - med: 34557.951 - 95pct: 35730.943 - 99pct: 36754.431 - 99.9pct: 36754.431 - 99.99pct: 36754.431 - Max: 36754.431
2023-01-28T17:11:37,933-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     290 msg ---      7.1 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 40229.598 ms - med: 40292.095 - 95pct: 41613.823 - 99pct: 41755.391 - 99.9pct: 41793.023 - 99.99pct: 41793.023 - Max: 41793.023
2023-01-28T17:11:47,974-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     407 msg ---     11.7 msg/s ---      0.1 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 40049.042 ms - med: 40224.767 - 95pct: 41840.383 - 99pct: 41926.143 - 99.9pct: 42023.679 - 99.99pct: 42023.679 - Max: 42023.679
2023-01-28T17:11:58,005-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     634 msg ---     22.6 msg/s ---      0.2 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 33787.342 ms - med: 34910.975 - 95pct: 38198.783 - 99pct: 38539.263 - 99.9pct: 38679.039 - 99.99pct: 38679.039 - Max: 38679.039
2023-01-28T17:12:08,028-0800 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:     994 msg ---     35.9 msg/s ---      0.3 Mbit/s  --- failure      0.0 msg/s --- Latency: mean: 11431.892 ms - med: 11016.831 - 95pct: 24443.903 - 99pct: 25750.655 - 99.9pct: 26150.911 - 99.99pct: 26150.911 - Max: 26150.911
2023-01-28T17:12:08,578-0800 [pulsar-perf-producer-exec-1-1] INFO  org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached the maximum number: 1000 of production) --------------
2023-01-28T17:12:13,584-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats --- 1000 records sent --- 9.439 msg/s --- 0.074 Mbit/s
2023-01-28T17:12:13,612-0800 [Thread-0] INFO  org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats --- Latency: mean: 23949.955 ms - med: 26380.799 - 95pct: 41126.399 - 99pct: 41821.183 - 99.9pct: 41926.143 - 99.99pct: 42023.679 - 99.999pct: 42023.679 - Max: 42023.679

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#24

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 28, 2023
@Technoboy- Technoboy- added this to the 2.12.0 milestone Jan 28, 2023
@Technoboy- Technoboy- closed this Jan 28, 2023
@Technoboy- Technoboy- reopened this Jan 28, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2023

Codecov Report

Merging #19343 (3cb8c55) into master (644be5f) will increase coverage by 30.08%.
The diff coverage is 87.50%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #19343       +/-   ##
=============================================
+ Coverage     32.42%   62.50%   +30.08%     
- Complexity     6347    25420    +19073     
=============================================
  Files          1644     1818      +174     
  Lines        123712   133112     +9400     
  Branches      13486    14644     +1158     
=============================================
+ Hits          40109    83205    +43096     
+ Misses        77694    42231    -35463     
- Partials       5909     7676     +1767     
Flag Coverage Δ
inttests 24.98% <75.00%> (+0.04%) ⬆️
systests 25.52% <87.50%> (-0.14%) ⬇️
unittests 59.66% <87.50%> (+42.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sar/broker/service/persistent/PersistentTopic.java 63.44% <87.50%> (+12.37%) ⬆️
...service/nonpersistent/NonPersistentReplicator.java 58.41% <0.00%> (-1.99%) ⬇️
...in/java/org/apache/pulsar/common/api/AuthData.java 71.42% <0.00%> (ø)
...ava/org/apache/pulsar/client/api/schema/Field.java 80.00% <0.00%> (ø)
.../apache/pulsar/broker/namespace/LookupOptions.java 87.50% <0.00%> (ø)
.../apache/pulsar/common/naming/SystemTopicNames.java 55.55% <0.00%> (ø)
...org/apache/pulsar/common/io/BatchSourceConfig.java 50.00% <0.00%> (ø)
...pache/pulsar/common/configuration/BindAddress.java 22.22% <0.00%> (ø)
...g/apache/pulsar/common/functions/WindowConfig.java 92.85% <0.00%> (ø)
...lsar/client/impl/conf/ReaderConfigurationData.java 81.39% <0.00%> (ø)
... and 1184 more

@mattisonchao
Copy link
Member

Question:
When this logic is removed, if the cursor has no consumers and it is set to an inactive state by updateStats. Will a new subscription activate the cursor?

@mattisonchao mattisonchao self-requested a review January 28, 2023 11:04
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to call it only for the single subscription, instead of for all of them

@heesung-sn
Copy link
Contributor Author

Also,

If we still want to call this operation, checkBackloggedCursors asynchronously but more frequently, we could run such async operations in a separate thread and run it more frequently, every x milliseconds(out side of updateStats), if there are any requests (with deduplication).

// Example
'''
AsyncOpManager:

var dedupedOps

void schedule(Op op) // dedup/add op to dedupedOps.

void start(){

while(true)
// wakeup every x milliseconds s(configurable) and run the deduped ops

}
'''

@heesung-sn heesung-sn force-pushed the removeCheckBackloggedCursorsFromSubscribe branch from 1f41b31 to 70f1883 Compare January 29, 2023 01:18
@heesung-sn
Copy link
Contributor Author

heesung-sn commented Jan 29, 2023

I think it would be better to call it only for the single subscription, instead of for all of them

Updated the code to check the single subscription.

@heesung-sn heesung-sn force-pushed the removeCheckBackloggedCursorsFromSubscribe branch from 70f1883 to 559194d Compare January 29, 2023 01:26
…dCursor(single subscription check) from subscribe
@heesung-sn heesung-sn force-pushed the removeCheckBackloggedCursorsFromSubscribe branch from 559194d to 3cb8c55 Compare January 29, 2023 01:27
@heesung-sn heesung-sn changed the title [improve][pulsar] Removed checkBackloggedCursors() from subscribe [improve][pulsar] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription check) from subscribe Jan 29, 2023
Copy link
Member

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattisonchao
Copy link
Member

@heesung-sn Please change the PR title.

Unknown scope "pulsar" found in pull request title "[improve][pulsar] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription check) from subscribe". Use one of the available scopes: admin, bk, broker, build, ci, cli, client, doc, fn, io, meta, misc, ml, monitor, offload, proxy, schema, sec, site, sql, storage, test, txn, ws, zk.

@heesung-sn heesung-sn changed the title [improve][pulsar] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription check) from subscribe [improve][broker] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription Jan 29, 2023
@heesung-sn
Copy link
Contributor Author

Oops, thanks. Updated the title.

@heesung-sn heesung-sn changed the title [improve][broker] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription [improve][broker] Replaced checkBackloggedCursors with checkBackloggedCursor(single subscription check) upon subscription Jan 29, 2023
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @heesung-sn! Merging..

@tisonkun tisonkun merged commit ea6641e into apache:master Jan 29, 2023
merlimat pushed a commit that referenced this pull request Jan 30, 2023
…dCursor(single subscription check) upon subscription (#19343)
merlimat pushed a commit that referenced this pull request Jan 30, 2023
…dCursor(single subscription check) upon subscription (#19343)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Feb 28, 2023
…dCursor(single subscription check) upon subscription (apache#19343)

(cherry picked from commit 6b067a5)
@heesung-sn heesung-sn deleted the removeCheckBackloggedCursorsFromSubscribe branch April 2, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants