KAFKA-17367: Share coordinator impl. Broker side code. [2/N] #17011

smjn · 2024-08-27T08:21:27Z

Added impl for ShareCoordinatorService and ShareCoordinatorShard
Moved group-coordinator: GroupCoordinatorRuntimeMetrics -> coordinator-common: CoordinatorRuntimeMetricsImpl. The new impl class can be inherited and used in both group and share coordinators.
Added tests wherever applicable
Added plumbing in BrokerServer and BrokerMetadataPublisher to create share coordinator and start/stop it.
Added ShareCoordinatorConfig class to house various coordinator related configs.
Added code to create share state topic in AutoTopicCreationManager.scala
Added

share.coordinator.state.topic.replication.factor=1
share.coordinator.state.topic.min.isr=1

to config/kraft/{broker.properties, controller.properties, server.properties} to make it easier for people to try out.

mumrah

Thanks for the patch @smjn! This is a pretty big PR, so I've just reviewed part of it for now.

Regarding the leader epoch, it seems like we are checking the MetadataImage for some things (like if a topic exists), but using the given leader epoch in the RPC to see if the leader has changed. I wonder if we should be checking the given leader epoch against the MetadataImage.

Where are we handling topic deletions and re-creations? Maybe this is coming in a future PR?

mumrah · 2024-08-27T15:01:00Z

core/src/main/java/kafka/server/builders/KafkaApisBuilder.java

@@ -195,6 +202,7 @@ public KafkaApis build() {
                             replicaManager,
                             groupCoordinator,
                             txnCoordinator,
+                             shareCoordinator,


Need a null check above

share coordinator can be null - there is a specific flag on basis of which share coord is initialized in BrokerServer, otw it is null. I can do a Optional perhaps

mumrah · 2024-08-27T15:02:35Z

core/src/main/scala/kafka/server/KafkaServer.scala

@@ -534,7 +534,8 @@ class KafkaServer(
          Some(adminManager),
          Some(kafkaController),
          groupCoordinator,
-          transactionCoordinator
+          transactionCoordinator,
+          null


Use Option instead of null here.

mumrah · 2024-08-27T15:15:39Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                            int partition = topicEntry.getKey();
+                            try {
+                                long timeTaken = time.hiResClockMs() - startTime;
+                                WriteShareGroupStateResponseData partitionData = future.get();


This will block indefinitely if the future is not complete. I see we have combined these futures up on L309, but maybe we can replace this get with either getNow(null) or get(0L, TimeUnit.Millisecond) just to ensure we're never blocking here.

mumrah · 2024-08-27T15:31:22Z

core/src/main/scala/kafka/server/KafkaApis.scala

+    shareCoordinator.readState(request.context, readShareGroupStateRequest.data)
+      .thenAccept(data => requestHelper.sendMaybeThrottle(request, new ReadShareGroupStateResponse(data)))


You probable want CompletableFuture#handle here instead. That will allow you to handle exceptions from the future chain explicitly. See #12403 for some context

mumrah · 2024-08-27T15:34:20Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+    private static <P> boolean isEmpty(List<P> list) {
+        return list == null || list.isEmpty() || list.get(0) == null;
+    }


In what case would we expect a list with a null first element?

mumrah · 2024-08-27T16:14:22Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+        log.debug("ShareCoordinatorService writeState request dump - {}", request);
+
+        String groupId = request.groupId();
+        Map<Uuid, Map<Integer, CompletableFuture<WriteShareGroupStateResponseData>>> futureMap = new HashMap<>();


nit: move this down closer to where it's used

mumrah · 2024-08-27T16:28:44Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                                WriteShareGroupStateResponseData partitionData = future.get();
+                                // This is the future returned by runtime.scheduleWriteOperation which returns when the
+                                // operation has completed including
+                                shareCoordinatorMetrics.globalSensors.get(ShareCoordinatorMetrics.SHARE_COORDINATOR_WRITE_LATENCY_SENSOR_NAME)


I'm not sure this is the pattern we want for metrics. In GC, I see we are calling record(String sensorName) rather than accessing the globalSensors

mumrah · 2024-08-27T16:29:52Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+            List<WriteShareGroupStateResponseData.WriteStateResult> writeStateResults = futureMap.keySet().stream()
+                .map(topicId -> {
+                    List<WriteShareGroupStateResponseData.PartitionResult> partitionResults = futureMap.get(topicId).entrySet().stream()
+                        .map(topicEntry -> {


The nested streams make the control flow here hard to follow. See if you can untangle it a bit. Maybe regular for loops would be more readable.

mumrah · 2024-08-27T16:38:47Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+                    .setStateEpoch(newStateEpoch)
+                    .setStateBatches(batchesToAdd)
+                    .build()));
+            snapshotUpdateCount.put(key, 0);


We should not update a timeline structure here. Recall that this method generate the records and eventual response for the proposed write. We cannot update our in-memory state until the write is committed.

It looks like snapshotUpdateCount is used to determine when we should write a snapshot instead of a delta (L299). We can reset this counter to zero in handleShareSnapshot as we're replaying records

mumrah · 2024-08-27T16:41:40Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        // Updating the leader map with the new leader epoch
+        leaderEpochMap.put(coordinatorKey, leaderEpoch);


Why are we updating the in-memory state during a read? Is this part of the leader epoch fencing?

Yes it is.

Caller might issue a read without ever making a write request.

Will this logic be replaced by the init RPC once it is added?

In general, we should not be updating our log based in-memory state on reads. It's probably safe in this case just based on the nature of leader epoch, but it is definitely atypical.

No, this will stay.
As per the spec initialize RPC does not include a leader epoch value which we can reference
https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka-InitializeShareGroupStateAPI

Ok. Can you add a comment at leaderEpochMap declaration noting that it can be updated on a read request?

Hmm, David has a valid point here. It's not clear why we need to update leaderEpoch on reads. My understanding of the fencing logic is that we just need to prevent an old reader from reading newly updated state. So, updating leaderEpoch on writes is enough. Also note that since this update is not persisted, it will be lost on leader change.

smjn · 2024-08-27T18:59:21Z

Thanks for the patch @smjn! This is a pretty big PR, so I've just reviewed part of it for now.

Regarding the leader epoch, it seems like we are checking the MetadataImage for some things (like if a topic exists), but using the given leader epoch in the RPC to see if the leader has changed. I wonder if we should be checking the given leader epoch against the MetadataImage.

Where are we handling topic deletions and re-creations? Maybe this is coming in a future PR?

For EA only read and write RPCs will be provided.

mumrah

Thanks for the updates 👍 some additional comments inline

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

mumrah · 2024-08-29T15:01:34Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+        }
+
+        log.info("Shutting down.");
+        isActive.set(false);


I think this is redundant due to L231

mumrah · 2024-08-29T15:02:43Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                        (partitionId, responseFut) -> {
+                            try {
+                                partitionResults.add(
+                                    responseFut.get(5000L, TimeUnit.MILLISECONDS).results().get(0).partitions().get(0)


Can you leave a comment here explaining that the future will already be completed at this point?

mumrah · 2024-08-29T15:03:46Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+            if (config.shareCoordinatorSnapshotUpdateRecordsPerSnapshot() < 0 || config.shareCoordinatorSnapshotUpdateRecordsPerSnapshot() > 500)
+                throw new IllegalArgumentException("SnapshotUpdateRecordsPerSnapshot must be between 0 and 500.");


Shouldn't this kind of validation happen in the ShareCoordinatorConfig class?

mumrah · 2024-08-29T15:10:16Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        // Updating the leader map with the new leader epoch
+        leaderEpochMap.put(coordinatorKey, leaderEpoch);


Will this logic be replaced by the init RPC once it is added?

In general, we should not be updating our log based in-memory state on reads. It's probably safe in this case just based on the nature of leader epoch, but it is definitely atypical.

mumrah · 2024-08-29T15:10:49Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+                null, partitionId, Errors.INVALID_TOPIC_EXCEPTION, Errors.INVALID_TOPIC_EXCEPTION.message()));
+        }
+
+        if (partitionId < 0) {
+            return Optional.of(ReadShareGroupStateResponse.toErrorResponseData(
+                topicId, partitionId, Errors.INVALID_PARTITIONS, Errors.INVALID_PARTITIONS.message()));


Those error messages are pretty generic. We should be more specific here.

I'll go with invalid_request and custom message then

mumrah · 2024-08-29T15:14:54Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+                handleShareSnapshot((ShareSnapshotKey) key.message(), (ShareSnapshotValue) messageOrNull(value));
+                break;
+            case ShareCoordinator.SHARE_UPDATE_RECORD_KEY_VERSION: // ShareUpdate
+                handleShareUpdate((ShareUpdateKey) key.message(), (ShareUpdateValue) messageOrNull(value));


When can there be a null record value?

In the future when we delete these things, I expect we'll write tombstones as markers even though it's not a compacted topic. This enables us to bookkeep the records on the topic and work out what we can prune.

Yes, when Delete RPC is introduced (tombstone)

mumrah · 2024-08-29T15:22:04Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

    public ReadShareGroupStateResponseData readState(ReadShareGroupStateRequestData request, Long offset) {
-        throw new RuntimeException("Not implemented");
+        log.debug("Read request dump - {}", request);


We already have a way to log requests with log4j (log4j.logger.kafka.request.logger) so we don't need to log requests here. If we want to log some other details here (like "reading share state for partitions: [...]") that would be fine

mumrah · 2024-08-29T15:24:43Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        if (metadataImage != null && (metadataImage.topics().getTopic(topicId) == null ||
+            metadataImage.topics().getPartition(topicId, partitionId) == null)) {


As we discussed offline, this usage of MetadataImage is okay despite the fact that it may differ from the metadata known to the share partition leader.

For the newly created topic scenario, the UNKNOWN_TOPIC_OR_PARTITION is returned which is retriable. For the case of a recently deleted topic, there's no harm in letting the write go through.

smjn · 2024-08-29T16:02:44Z

Will this logic be replaced by the init RPC once it is added?
In general, we should not be updating our log based in-memory state on reads. It's probably safe in this case just based >on the nature of leader epoch, but it is definitely atypical.

No it will stay.
The initialize rpc does not include any information about the leader epoch which we can store and reference.
We are going by epoch definition.

AndrewJSchofield

Thanks for the PR. A first round of review comments.

AndrewJSchofield · 2024-08-29T14:59:31Z

core/src/main/scala/kafka/server/KafkaApis.scala

+    shareCoordinator match {
+      case None => requestHelper.sendResponseMaybeThrottle(request, requestThrottleMs =>
+        readShareGroupStateRequest.getErrorResponse(requestThrottleMs,
+          new ApiException("Share coordinator is not configured.")))


I would say that's an internal server error. Only Kafka code issues this RPC. It only does it when the share coordinator is enabled.

AndrewJSchofield · 2024-08-29T15:43:31Z

core/src/main/scala/kafka/server/KafkaApis.scala

+    shareCoordinator match {
+      case None => requestHelper.sendResponseMaybeThrottle(request, requestThrottleMs =>
+        writeShareRequest.getErrorResponse(requestThrottleMs,
+          new ApiException("Share coordinator is not configured.")))


And this one.

AndrewJSchofield · 2024-08-29T15:57:06Z

...inator/src/main/java/org/apache/kafka/coordinator/group/metrics/GroupCoordinatorMetrics.java

@@ -150,6 +159,34 @@ public GroupCoordinatorMetrics(MetricsRegistry registry, Metrics metrics) {
            Collections.singletonMap(CONSUMER_GROUP_COUNT_STATE_TAG, ConsumerGroupState.DEAD.toString())
        );

+        shareGroupCountMetricName = metrics.metricName(
+            SHARE_GROUP_COUNT_METRIC_NAME,


These metric names do not match the KIP. I am certainly happy to tweak the KIP, but either way the KIP and the code need to match. Let me know what you want to do here @smjn .

AndrewJSchofield · 2024-08-29T16:00:10Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+        String groupId = request.groupId();
+        // Send an empty response if groupId is invalid
+        if (isGroupIdEmpty(groupId)) {
+            log.error("Group id must be specified and non-empty: {}", request);


Probably INVALID_REQUEST if it's empty.

It was previously decided to respond with empty response. We even added a test for the same.

AndrewJSchofield · 2024-08-29T16:07:05Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                                long timeTaken = time.hiResClockMs() - startTime;
+                                // This is the future returned by runtime.scheduleWriteOperation which returns when the
+                                // operation has completed including
+                                WriteShareGroupStateResponseData partitionData = responseFut.get(5000L, TimeUnit.MILLISECONDS);


Doesn't this only run when the inidividual futures are complete? I'm surprised you think you need to wait up to 5 seconds here.

Added arbitrary number. Fixed in next revision

AndrewJSchofield · 2024-08-29T16:08:26Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                                WriteShareGroupStateResponseData partitionData = responseFut.get(5000L, TimeUnit.MILLISECONDS);
+                                shareCoordinatorMetrics.record(ShareCoordinatorMetrics.SHARE_COORDINATOR_WRITE_LATENCY_SENSOR_NAME, timeTaken);
+                                partitionResults.addAll(partitionData.results().get(0).partitions());
+                            } catch (Exception e) {


Is there a reason why you're not using GroupCoordinatorService.handleOperationException?

AndrewJSchofield · 2024-08-29T16:10:14Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+                handleShareSnapshot((ShareSnapshotKey) key.message(), (ShareSnapshotValue) messageOrNull(value));
+                break;
+            case ShareCoordinator.SHARE_UPDATE_RECORD_KEY_VERSION: // ShareUpdate
+                handleShareUpdate((ShareUpdateKey) key.message(), (ShareUpdateValue) messageOrNull(value));


In the future when we delete these things, I expect we'll write tombstones as markers even though it's not a compacted topic. This enables us to bookkeep the records on the topic and work out what we can prune.

AndrewJSchofield

A few more comments.

AndrewJSchofield · 2024-08-31T20:00:40Z

core/src/main/scala/kafka/server/KafkaApis.scala

@@ -35,7 +35,7 @@ import org.apache.kafka.common.acl.AclOperation
 import org.apache.kafka.common.acl.AclOperation._
 import org.apache.kafka.common.config.ConfigResource
 import org.apache.kafka.common.errors._
-import org.apache.kafka.common.internals.Topic.{GROUP_METADATA_TOPIC_NAME, TRANSACTION_STATE_TOPIC_NAME, isInternal}
+import org.apache.kafka.common.internals.Topic.{GROUP_METADATA_TOPIC_NAME, TRANSACTION_STATE_TOPIC_NAME, SHARE_GROUP_STATE_TOPIC_NAME, isInternal}


A trivial point, but I would make the topic names in the import statement alphabetical.

AndrewJSchofield · 2024-08-31T20:03:15Z

core/src/test/scala/unit/kafka/server/AutoTopicCreationManagerTest.scala

@@ -38,7 +38,8 @@ import org.apache.kafka.common.security.auth.{KafkaPrincipal, KafkaPrincipalSerd
 import org.apache.kafka.common.utils.{SecurityUtils, Utils}
 import org.apache.kafka.coordinator.transaction.TransactionLogConfigs
 import org.apache.kafka.coordinator.group.{GroupCoordinator, GroupCoordinatorConfig}
-import org.apache.kafka.server.config.ServerConfigs
+import org.apache.kafka.coordinator.share.ShareCoordinator
+import org.apache.kafka.server.config.{ServerConfigs, ShareCoordinatorConfig}


I wonder whether ShareCoordinatorConfig should be in package o.a.k.coordinator.share.

Yes, there are some other interfaces and classes which need to be moved.
I will raise a separate PR for that to not pollute this one.

AndrewJSchofield · 2024-08-31T20:04:51Z

core/src/test/scala/unit/kafka/server/KafkaApisTest.scala

+    ).asJava
+
+    val config = Map(
+      GroupCoordinatorConfig.NEW_GROUP_COORDINATOR_ENABLE_CONFIG -> "true",


I believe it is the case that the new group coordinator is enabled by default in trunk now. I expect this test no longer needs to enable it.

AndrewJSchofield · 2024-08-31T20:15:01Z

server/src/main/java/org/apache/kafka/server/config/ShareCoordinatorConfig.java

+    public static final int STATE_TOPIC_NUM_PARTITIONS_DEFAULT = 50;
+    public static final String STATE_TOPIC_NUM_PARTITIONS_DOC = "The number of partitions for the share-group state topic (should not change after deployment).";
+
+    public static final String STATE_TOPIC_REPLICATION_FACTOR_CONFIG = "share.coordinator.state.topic.replication.factor";


As mentioned in the KIP, the values of share.coordinator.state.topic.replication.factor=1 and share.coordinator.state.topic.min.isr=1 should be included in the Internal Topic Settings section of the properties files in the directory config/kraft. I suggest you do that in this PR since the configs are being introduced now. This will make it a bit easier for people using a single-broker configuration taken directly from GitHub to try this out.

mumrah

Thanks for the updates @smjn! Few more comments.

mumrah · 2024-09-03T15:38:58Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                    topicEntry.forEach(
+                        // map of partition id -> responses from api
+                        (partitionId, responseFut) -> {
+                            long timeTaken = time.hiResClockMs() - startTime;


this is an extremely pedantic nitpick, but we should get the current time once outside of the loop. otherwise, the time delta calculation will be somewhat skewed by the execution time of the loop

mumrah · 2024-09-03T15:43:30Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                futureMap.computeIfAbsent(topicId, k -> new HashMap<>());
+                futureMap.get(topicId).put(partitionData.partition(), future);


The Map "compute" methods return the value, so you don't need to get it again

mumrah · 2024-09-03T15:44:42Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+
+        // Transform the combined CompletableFuture<Void> into CompletableFuture<ReadShareGroupStateResponseData>
+        return combinedFuture.thenApply(v -> {
+            List<ReadShareGroupStateResponseData.ReadStateResult> readStateResult = new LinkedList<>();


Why a linked list here (and other similar places)? Seems like we know the expected size of the resulting list, so we could allocate an ArrayList with the size.

@smjn do you plan on fixing this here or in a follow up PR? Either is fine by me.

junrao

@smjn : Thanks for the PR. Left a few comments.

junrao · 2024-08-30T21:35:46Z

core/src/main/scala/kafka/server/KafkaApis.scala

@@ -1689,8 +1694,8 @@ class KafkaApis(val requestChannel: RequestChannel,
          (txnCoordinator.partitionFor(key), TRANSACTION_STATE_TOPIC_NAME)

        case CoordinatorType.SHARE =>
-          // When share coordinator support is implemented in KIP-932, a proper check will go here
-          return (Errors.COORDINATOR_NOT_AVAILABLE, Node.noNode)
+          // None check already done above


Hmm, what check is the comment referring to?

Sharecoordinator is an Optional. The line below the comment is directly calling Sharecoordinator.get, because the case shareCoordinator.isEmpty check is already done in line 1703

junrao · 2024-08-30T22:07:58Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+     */
+    @SuppressWarnings("NPathComplexity")
+    public CoordinatorResult<WriteShareGroupStateResponseData, CoordinatorRecord> writeState(
+        RequestContext context,


context seems unused?

junrao · 2024-08-30T22:25:55Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        if (stateEpochMap.containsKey(mapKey) && stateEpochMap.get(mapKey) > partitionData.stateEpoch()) {
+            return Optional.of(getWriteErrorResponse(Errors.FENCED_STATE_EPOCH, null, topicId, partitionId));
+        }
+        if (metadataImage != null && (metadataImage.topics().getTopic(topicId) == null ||


If metadataImage is null, we should return UNKNOWN_TOPIC_OR_PARTITION too, right?

junrao · 2024-08-30T23:28:04Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+            } else {
+                // start offset is being updated - we should only
+                // consider new updates to batches
+                batchesToAdd = partitionData.stateBatches().stream()


Hmm, why are we only consider the new updates when start offset changes? Consider the following example.

State in ShareCoordinator:
startOffset: 100
Batch1 {
firstOffset: 100
lastOffset: 109
deliverState: Acquired
deliverCount: 1
}
Batch2 {
firstOffset: 110
lastOffset: 119
deliverState: Acquired
deliverCount: 2
}
Batch3 {
firstOffset: 120
lastOffset: 129
deliverState: Acquired
deliverCount: 0
}

Share leader acks batch 1 and sends the state of batch 1 to Share Coordinator.

Share leader advances startOffset to 110.

Share leader acks batch 3 and sends the new startOffset and the state of batch 3 to share coordinator.

Share coordinator writes the snapshot with startOffset 110 and batch 3.

Now the deliver count for batch 2 is lost.

Fixed and added test for the scenario testNonSequentialBatchUpdates

junrao · 2024-09-03T17:07:58Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+    private final Logger log;
+    private final Time time;
+    private final CoordinatorTimer<Void, CoordinatorRecord> timer;
+    private final ShareCoordinatorConfig config;


time, timer and config seem unused?

time and timer are required to be present by the CoordinatorShardBuilder runtime. Will remove the fields and keep methods

junrao · 2024-09-03T17:24:54Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+                            // should be complete as we used CompletableFuture::allOf to get a combined future from
+                            // all futures in the map.
+                            WriteShareGroupStateResponseData partitionData = responseFut.getNow(null);
+                            shareCoordinatorMetrics.record(ShareCoordinatorMetrics.SHARE_COORDINATOR_WRITE_LATENCY_SENSOR_NAME, timeTaken);


timeTaken is the same for all topic partitions in the request. It doesn't seem right to record the same value per partition?

smjn · 2024-09-03T21:34:27Z

@junrao Thanks for the review, incorporated comments.

mumrah

Few more comments, @smjn. I think we're close!

mumrah · 2024-09-09T19:06:05Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+        request.topics().forEach(topicData -> {
+            Map<Integer, CompletableFuture<WriteShareGroupStateResponseData>> partitionFut =
+                futureMap.computeIfAbsent(topicData.topicId(), k -> new HashMap<>());
+            topicData.partitions().forEach(


Personally, I find the logic here a little hard to follow with the nesting and futures. I'd like us to consider refactoring this a bit in a future PR.

mumrah · 2024-09-09T19:07:52Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+
+        // Transform the combined CompletableFuture<Void> into CompletableFuture<ReadShareGroupStateResponseData>
+        return combinedFuture.thenApply(v -> {
+            List<ReadShareGroupStateResponseData.ReadStateResult> readStateResult = new LinkedList<>();


@smjn do you plan on fixing this here or in a follow up PR? Either is fine by me.

mumrah · 2024-09-09T19:08:45Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+        // be looping over the keys below and constructing new WriteShareGroupStateRequestData objects to pass
+        // onto the shard method.
+        Map<Uuid, Map<Integer, CompletableFuture<WriteShareGroupStateResponseData>>> futureMap = new HashMap<>();
+        long startTime = time.hiResClockMs();


nit: I'd name this something like "now" or "nowMs" to make it clear what it is. Sometimes "startTime" can refer to something in the far past (like the start of an event).

mumrah · 2024-09-09T19:09:21Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorService.java

+
+            // time taken for write
+            shareCoordinatorMetrics.record(ShareCoordinatorMetrics.SHARE_COORDINATOR_WRITE_LATENCY_SENSOR_NAME,
+                time.hiResClockMs() - startTime);


nit: you have defined startTime above. I think we should use that here.

This is being used to calculate the delta
we want to record how much time it took for the writeState call.

mumrah · 2024-09-09T19:12:58Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        // Updating the leader map with the new leader epoch
+        leaderEpochMap.put(coordinatorKey, leaderEpoch);


Ok. Can you add a comment at leaderEpochMap declaration noting that it can be updated on a read request?

mumrah

LGTM

I left another comment about some additional testing, but that can come later. I'd like to land this PR so we can unblock some other work and start making incremental improvements.

mumrah · 2024-09-09T20:33:29Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+    private static List<PersisterOffsetsStateBatch> combineStateBatches(
+        Collection<PersisterOffsetsStateBatch> currentBatch,
+        Collection<PersisterOffsetsStateBatch> newBatch,
+        long startOffset
+    ) {


I think we can tighten up this signature. We know what types are expected, so we can specify them here. That will make the code below a bit more understandable I think.

I'd also like to see some unit tests for this method. It seems like a good candidate.

apoorvmittal10

Thanks for the PR.

Introduces the share coordinator. This coordinator is built on the new coordinator runtime framework. It is responsible for persistence of share-group state in a new internal topic named "__share_group_state". The responsibility for being a share coordinator is distributed across the brokers in a cluster. Reviewers: David Arthur <mumrah@gmail.com>, Andrew Schofield <aschofield@confluent.io>, Apoorv Mittal <apoorvmittal10@gmail.com>

junrao

@smjn : Thanks for the updated PR. Left a few more followup comments.

junrao · 2024-09-09T21:30:11Z

core/src/main/scala/kafka/server/metadata/BrokerMetadataPublisher.scala

@@ -313,6 +328,15 @@ class BrokerMetadataPublisher(
    } catch {
      case t: Throwable => fatalFaultHandler.handleFault("Error starting TransactionCoordinator", t)
    }
+    if (config.shareGroupConfig.isShareGroupEnabled && shareCoordinator.isDefined) {


In BrokerServer, we create shareCoordinator only if config.shareGroupConfig.isShareGroupEnabled is true. So, it seems there is no need to check config.shareGroupConfig.isShareGroupEnabled here?

junrao · 2024-09-09T22:46:57Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+     * @return CoordinatorResult(records, response)
+     */
+    @SuppressWarnings("NPathComplexity")
+    public CoordinatorResult<WriteShareGroupStateResponseData, CoordinatorRecord> writeState(


This method doesn't directly write the state. It generates the state to be written by the caller. So, perhaps rename to generateRecordsAndResult?

junrao · 2024-09-09T23:51:52Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        // Updating the leader map with the new leader epoch
+        leaderEpochMap.put(coordinatorKey, leaderEpoch);


Hmm, David has a valid point here. It's not clear why we need to update leaderEpoch on reads. My understanding of the fencing logic is that we just need to prevent an old reader from reading newly updated state. So, updating leaderEpoch on writes is enough. Also note that since this update is not persisted, it will be lost on leader change.

junrao · 2024-09-10T23:00:22Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+    private final TimelineHashMap<SharePartitionKey, ShareGroupOffset> shareStateMap;  // coord key -> ShareGroupOffset
+    // leaderEpochMap can be updated by writeState call
+    // or if a newer leader makes a readState call.
+    private final TimelineHashMap<SharePartitionKey, Integer> leaderEpochMap;


Where is the synchronization on leaderEpochMap since it's read and written by different threads? Ditto for other TimelineHashMap below.

junrao · 2024-09-10T23:02:21Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+    private final TimelineHashMap<SharePartitionKey, Integer> leaderEpochMap;
+    private final TimelineHashMap<SharePartitionKey, Integer> snapshotUpdateCount;
+    private final TimelineHashMap<SharePartitionKey, Integer> stateEpochMap;
+    private MetadataImage metadataImage;


There is no synchronization on this. Should this be volatile?

junrao · 2024-09-10T23:10:45Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        shareStateMap.put(mapKey, offsetRecord);
+        // if number of share updates is exceeded, then reset it
+        if (snapshotUpdateCount.containsKey(mapKey)) {
+            if (snapshotUpdateCount.get(mapKey) >= config.shareCoordinatorSnapshotUpdateRecordsPerSnapshot()) {


Hmm, not sure I understand this. Every time we have a new snapshot, we should always reset the count to 0 independent the current count, right?

junrao · 2024-09-10T23:23:04Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+        List<CoordinatorRecord> validRecords = new LinkedList<>();
+
+        WriteShareGroupStateResponseData responseData = new WriteShareGroupStateResponseData();
+        for (CoordinatorRecord record : recordList) {  // should be single record


If recordList always contains a single record, why does it need to be a list?

junrao · 2024-09-10T23:29:04Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+
+            if (shareStateMap.containsKey(mapKey) && shouldIncSnapshotEpoch) {
+                ShareGroupOffset oldValue = shareStateMap.get(mapKey);
+                ((ShareSnapshotValue) record.value().message()).setSnapshotEpoch(oldValue.snapshotEpoch() + 1);  // increment the snapshot epoch


Do we need to roll the value to 0 when it exceeds 65535?

junrao · 2024-09-12T00:33:45Z

share-coordinator/src/main/java/org/apache/kafka/coordinator/share/ShareCoordinatorShard.java

+    ) {
+        currentBatch.removeAll(newBatch);
+        List<PersisterOffsetsStateBatch> batchesToAdd = new LinkedList<>(currentBatch);
+        batchesToAdd.addAll(newBatch);


It would be useful to verify if this is an issue. Suppose currentBatch is
batch1 {
firstOffset: 100
lastOffset: 109
deliverState: Available
deliverCount: 1
}
and newBatch is
batch2{
firstOffset: 105
lastOffset: 105
deliverState: Acknowledge
deliverCount: 1
}

After the call to combineStateBatches(), Share coordinator will have both batches in its state and thus the share leader could have both batches too (after initializing from ReadShareGroupState). Now suppose that the share leader fetches the following batch and calls SharePartition.acquire().

fetchedBatch{
firstOffset: 100
lastOffset: 109
}

Both batch1 and batch2 will match the fetched batch. When calling acquireSubsetBatchRecords() on batch1, we will add the full batch to AcquiredRecords. When calling acquireSubsetBatchRecords() on batch2, we will skip since the only record in it has been acked. But AcquiredRecords is unchanged after this. This means that we will return the full batch as acquired records, which is incorrect since offset 105 shouldn't be acquired.

smjn added 7 commits August 27, 2024 15:18

Share coordinator impl. Broker side code [2/N]

6604494

fixed merge conflicts.

adcff89

spotless fix

eb20a40

break long lines.

0e7f4de

remove redundant inheritance.

f71bd09

removed unnecessary test file.

634e05e

Added test for share state autotopic creation.

40ba047

smjn force-pushed the KAFKA-17367-2n branch from 7c4892b to 40ba047 Compare August 27, 2024 09:55

smjn added 3 commits August 27, 2024 18:09

fixed indentation.

06c0c7a

fixed indentation

9a438d4

added stateEpochMap checks.

ffe7a50

mumrah reviewed Aug 27, 2024

View reviewed changes

smjn added 10 commits August 28, 2024 00:33

incorporated review comments.

3606661

fixed err message.

4d07f1e

fixed logic to create share snapshot.

3e3ff0c

fixed perf issue.

1d00baf

fixed args to coord runtime builder.

7340c35

add additional coord specific configs.

dd27063

renamed write timeout ms config.

1d70b81

set min share updates to 0.

d179b3a

updated config docs per kip.

0a4bdd3

fine grained err message for invalid topic/part.

2d6e71e

mumrah reviewed Aug 29, 2024

View reviewed changes

AndrewJSchofield suggested changes Aug 29, 2024

View reviewed changes

smjn added 2 commits August 30, 2024 00:23

incorporated review comments.

aefa846

changed api exception message for SC read/write.

bcd165f

AndrewJSchofield suggested changes Aug 31, 2024

View reviewed changes

incorporated comments.

f362430

AndrewJSchofield approved these changes Sep 1, 2024

View reviewed changes

mumrah reviewed Sep 3, 2024

View reviewed changes

smjn added 2 commits September 3, 2024 21:48

incorporated comments.

f33c659

Merge remote-tracking branch 'ak/trunk' into KAFKA-17367-2n

8e75b0a

junrao reviewed Sep 3, 2024

View reviewed changes

incorporated comments.

02a2d01

smjn added 4 commits September 4, 2024 14:21

removed unused fields in ShareCoordinatorShard.

9f7d555

trigger rebuild

f14aaf9

Merge remote-tracking branch 'ak/trunk' into KAFKA-17367-2n

238cd59

fixed RequestLocal.nocaching calls in util method.

26de388

apoorvmittal10 added the KIP-932 Queues for Kafka label Sep 6, 2024

Added test case for non seq batch updates.

3a4fc6e

mumrah reviewed Sep 9, 2024

View reviewed changes

smjn added 4 commits September 10, 2024 01:08

incorporated comments.

2124efb

directly get size from value instead of lookup.

bb65fb2

fix bug in recording time elapsed for writeState

1e41c24

s/nowMs/startTimeMs

95a72e2

mumrah approved these changes Sep 9, 2024

View reviewed changes

apoorvmittal10 approved these changes Sep 9, 2024

View reviewed changes

mumrah merged commit 821c101 into apache:trunk Sep 10, 2024
5 of 7 checks passed

smjn mentioned this pull request Sep 10, 2024

KAFKA-17367: Share coordinator impl. Added additional tests. [3/N] #17149

Open

junrao reviewed Sep 11, 2024

View reviewed changes

junrao reviewed Sep 12, 2024

View reviewed changes

		shareCoordinator.readState(request.context, readShareGroupStateRequest.data)
		.thenAccept(data => requestHelper.sendMaybeThrottle(request, new ReadShareGroupStateResponse(data)))

		// Updating the leader map with the new leader epoch
		leaderEpochMap.put(coordinatorKey, leaderEpoch);

		if (config.shareCoordinatorSnapshotUpdateRecordsPerSnapshot() < 0 \|\| config.shareCoordinatorSnapshotUpdateRecordsPerSnapshot() > 500)
		throw new IllegalArgumentException("SnapshotUpdateRecordsPerSnapshot must be between 0 and 500.");

		if (metadataImage != null && (metadataImage.topics().getTopic(topicId) == null \|\|
		metadataImage.topics().getPartition(topicId, partitionId) == null)) {

		futureMap.computeIfAbsent(topicId, k -> new HashMap<>());
		futureMap.get(topicId).put(partitionData.partition(), future);

KAFKA-17367: Share coordinator impl. Broker side code. [2/N] #17011

KAFKA-17367: Share coordinator impl. Broker side code. [2/N] #17011

Conversation

smjn commented Aug 27, 2024 • edited Loading

mumrah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Aug 27, 2024

mumrah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Aug 29, 2024

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumrah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Sep 3, 2024

mumrah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smjn commented Aug 27, 2024 •

edited

Loading

smjn Aug 29, 2024 •

edited

Loading

smjn Sep 3, 2024 •

edited

Loading